Skip to main content

Kubernetes

Purpose

Lumie runs on K3s rather than a managed Kubernetes service. This page is a reference document for the Kubernetes layer itself: cluster topology, K3s runtime contracts, addon boundaries, and the handoff from Ansible bootstrap to Argo CD steady state.

For the higher-level cluster map, see Cluster Overview. For ingress and DNS, see Networking Overview.

Source Paths

PathRole
lumie-infra/provision/ansible/group_vars/all.ymlK3s version, CIDRs, registry mirror, master API target, kubeconfig paths
lumie-infra/provision/ansible/group_vars/masters.ymlServer installation flags and TLS SANs
lumie-infra/provision/ansible/roles/k3s-master/tasks/main.ymlK3s server install, token creation, registry mirror, and API readiness checks
lumie-infra/provision/ansible/roles/k3s-worker/tasks/main.ymlWorker token fetch, join flow, and registration verification
lumie-infra/provision/ansible/playbooks/fetch-kubeconfig.ymlPublic and private kubeconfig export for operators
lumie-infra/platform/traefik-config/helmchartconfig.yamlPatch for the k3s-bundled Traefik addon
lumie-infra/platform/coredns-config/daemonset.yamlCluster DNS workload shape override

Cluster Contract

The K3s baseline comes from Ansible group vars:

# lumie-infra/provision/ansible/group_vars/all.yml
k3s_version: "v1.34.3+k3s1"
k3s_cluster_cidr: "10.42.0.0/16"
k3s_service_cidr: "10.43.0.0/16"
zot_mirror_host: "zot.lumie-infra.com"
zot_mirror_endpoint: "http://zot.zot.svc.cluster.local:5000"
k3s_master_private_ip: "10.0.0.241"

That contract drives:

  • K3s server and agent version pinning.
  • Pod and Service CIDRs used by OCI security lists and worker routing.
  • Containerd registry mirror behavior on every node.
  • Worker join target for every agent rollout.

Topology

The intended topology in code is one server plus multiple agents across two OCI VCNs. The live cluster inspected on June 14, 2026 showed:

NodeRoleInternal IPVersion
k3s-mastercontrol-plane10.0.0.241v1.34.3+k3s1
k3s-worker-2worker10.0.0.2v1.34.3+k3s1
k3s-worker-3worker10.1.0.148v1.34.3+k3s1
k3s-worker-4worker10.1.0.9v1.34.3+k3s1

Treat that table as observed runtime state, not the desired-state schema. The desired-state schema is the Terraform node maps plus the Ansible inventory builder.

Runtime Flow

Entrypoints And Public Surface

SurfaceSource of truthNotes
Kubernetes APIk3s-master on https://10.0.0.241:6443Workers join over the private address
Operator kubeconfigplaybooks/fetch-kubeconfig.ymlExports k3s-public.yaml, k3s-private.yaml, and config symlink
Built-in Traefik addonK3s addon plus platform/traefik-config/helmchartconfig.yamlActive ingress controller in kube-system
CoreDNSk3s addon config plus repo-managed DaemonSet overrideOne DNS pod per node
Container registry mirror/etc/rancher/k3s/registries.yamlPoints pulls at in-cluster Zot

K3s Install And Join Behavior

The master role installs K3s only when the binary is missing, writes the registry mirror config before service start, then waits for /var/lib/rancher/k3s/server/node-token and the local API /healthz.

Workers do not carry static join tokens in Git. They:

  1. slurp the token from the master.
  2. Wait for 10.0.0.241:6443 to accept TCP connections.
  3. Install k3s-agent with K3S_URL and K3S_TOKEN.
  4. Poll kubectl get nodes on the master until the node registers.

That token flow is the main idempotency guard for worker rollouts: the token is read live from the cluster rather than copied into inventory.

Addon Boundaries

K3s ships some cluster services, but Lumie overrides pieces of them:

  • Traefik is not installed as a standalone Helm release in this repo. The active object is the k3s addon patched by HelmChartConfig.
  • CoreDNS still uses the kube-system/coredns ConfigMap supplied by the cluster, but the workload shape is replaced with a repo-managed DaemonSet.
  • local-path-provisioner and metrics-server remain k3s-managed runtime components; they are visible in the live cluster but not versioned under lumie-infra.

Operational Notes

  • Both master and worker roles template /etc/rancher/k3s/registries.yaml so containerd prefers the in-cluster Zot mirror. A broken Zot or Traefik path can therefore surface as node-level image pull failures.
  • The master role adds both public and private IPs as TLS SANs, and fetch-kubeconfig.yml rewrites 127.0.0.1 to each address for operator access.
  • Worker installation is serial per tenancy group to reduce join races and make failures isolate to a single node at a time.

Contract Drift

Two inspected sources disagree with the active cluster behavior:

  • lumie-infra/provision/ansible/README.md still says Traefik | Disabled, but masters.yml no longer passes --disable traefik, platform/traefik-config/helmchartconfig.yaml patches the bundled addon, and the live cluster has a running traefik pod in kube-system.
  • terraform.tfvars.example still includes worker-1, but the live cluster inspected on June 14, 2026 does not.

When documenting or debugging Kubernetes behavior, trust the installed roles and live cluster over the stale README summary.

Failure Modes

Failure pointBehavior
k3s_master_private_ip driftWorker joins stall even if SSH to nodes still works
Zot mirror path unavailablePods can fail image pulls across the cluster because containerd prefers the mirror
Traefik/CoreDNS addon assumptions wrongOperators can chase the wrong manifests if they assume these are fully standalone repo-managed installs
kubeconfig fetch skippedLocal kubectl access remains pointed at 127.0.0.1 from the server-local file

Verification

kubectl get nodes -o wide
kubectl get pods -n kube-system
kubectl get helmchartconfig -n kube-system traefik -o yaml
rg -n "k3s_version|k3s_cluster_cidr|k3s_service_cidr|zot_mirror|tls-san" \
lumie-infra/provision/ansible/group_vars \
lumie-infra/provision/ansible/roles