Kubernetes
Purpose
Lumie runs on K3s rather than a managed Kubernetes service. This page is a reference document for the Kubernetes layer itself: cluster topology, K3s runtime contracts, addon boundaries, and the handoff from Ansible bootstrap to Argo CD steady state.
For the higher-level cluster map, see Cluster Overview. For ingress and DNS, see Networking Overview.
Source Paths
| Path | Role |
|---|---|
lumie-infra/provision/ansible/group_vars/all.yml | K3s version, CIDRs, registry mirror, master API target, kubeconfig paths |
lumie-infra/provision/ansible/group_vars/masters.yml | Server installation flags and TLS SANs |
lumie-infra/provision/ansible/roles/k3s-master/tasks/main.yml | K3s server install, token creation, registry mirror, and API readiness checks |
lumie-infra/provision/ansible/roles/k3s-worker/tasks/main.yml | Worker token fetch, join flow, and registration verification |
lumie-infra/provision/ansible/playbooks/fetch-kubeconfig.yml | Public and private kubeconfig export for operators |
lumie-infra/platform/traefik-config/helmchartconfig.yaml | Patch for the k3s-bundled Traefik addon |
lumie-infra/platform/coredns-config/daemonset.yaml | Cluster DNS workload shape override |
Cluster Contract
The K3s baseline comes from Ansible group vars:
# lumie-infra/provision/ansible/group_vars/all.yml
k3s_version: "v1.34.3+k3s1"
k3s_cluster_cidr: "10.42.0.0/16"
k3s_service_cidr: "10.43.0.0/16"
zot_mirror_host: "zot.lumie-infra.com"
zot_mirror_endpoint: "http://zot.zot.svc.cluster.local:5000"
k3s_master_private_ip: "10.0.0.241"
That contract drives:
- K3s server and agent version pinning.
- Pod and Service CIDRs used by OCI security lists and worker routing.
- Containerd registry mirror behavior on every node.
- Worker join target for every agent rollout.
Topology
The intended topology in code is one server plus multiple agents across two OCI VCNs. The live cluster inspected on June 14, 2026 showed:
| Node | Role | Internal IP | Version |
|---|---|---|---|
k3s-master | control-plane | 10.0.0.241 | v1.34.3+k3s1 |
k3s-worker-2 | worker | 10.0.0.2 | v1.34.3+k3s1 |
k3s-worker-3 | worker | 10.1.0.148 | v1.34.3+k3s1 |
k3s-worker-4 | worker | 10.1.0.9 | v1.34.3+k3s1 |
Treat that table as observed runtime state, not the desired-state schema. The desired-state schema is the Terraform node maps plus the Ansible inventory builder.
Runtime Flow
Entrypoints And Public Surface
| Surface | Source of truth | Notes |
|---|---|---|
| Kubernetes API | k3s-master on https://10.0.0.241:6443 | Workers join over the private address |
| Operator kubeconfig | playbooks/fetch-kubeconfig.yml | Exports k3s-public.yaml, k3s-private.yaml, and config symlink |
| Built-in Traefik addon | K3s addon plus platform/traefik-config/helmchartconfig.yaml | Active ingress controller in kube-system |
| CoreDNS | k3s addon config plus repo-managed DaemonSet override | One DNS pod per node |
| Container registry mirror | /etc/rancher/k3s/registries.yaml | Points pulls at in-cluster Zot |
K3s Install And Join Behavior
The master role installs K3s only when the binary is missing, writes the registry mirror config before service start, then waits for /var/lib/rancher/k3s/server/node-token and the local API /healthz.
Workers do not carry static join tokens in Git. They:
slurpthe token from the master.- Wait for
10.0.0.241:6443to accept TCP connections. - Install
k3s-agentwithK3S_URLandK3S_TOKEN. - Poll
kubectl get nodeson the master until the node registers.
That token flow is the main idempotency guard for worker rollouts: the token is read live from the cluster rather than copied into inventory.
Addon Boundaries
K3s ships some cluster services, but Lumie overrides pieces of them:
- Traefik is not installed as a standalone Helm release in this repo. The active object is the k3s addon patched by
HelmChartConfig. - CoreDNS still uses the
kube-system/corednsConfigMap supplied by the cluster, but the workload shape is replaced with a repo-managed DaemonSet. local-path-provisionerandmetrics-serverremain k3s-managed runtime components; they are visible in the live cluster but not versioned underlumie-infra.
Operational Notes
- Both master and worker roles template
/etc/rancher/k3s/registries.yamlso containerd prefers the in-cluster Zot mirror. A broken Zot or Traefik path can therefore surface as node-level image pull failures. - The master role adds both public and private IPs as TLS SANs, and
fetch-kubeconfig.ymlrewrites127.0.0.1to each address for operator access. - Worker installation is serial per tenancy group to reduce join races and make failures isolate to a single node at a time.
Contract Drift
Two inspected sources disagree with the active cluster behavior:
lumie-infra/provision/ansible/README.mdstill saysTraefik | Disabled, butmasters.ymlno longer passes--disable traefik,platform/traefik-config/helmchartconfig.yamlpatches the bundled addon, and the live cluster has a runningtraefikpod inkube-system.terraform.tfvars.examplestill includesworker-1, but the live cluster inspected on June 14, 2026 does not.
When documenting or debugging Kubernetes behavior, trust the installed roles and live cluster over the stale README summary.
Failure Modes
| Failure point | Behavior |
|---|---|
k3s_master_private_ip drift | Worker joins stall even if SSH to nodes still works |
| Zot mirror path unavailable | Pods can fail image pulls across the cluster because containerd prefers the mirror |
| Traefik/CoreDNS addon assumptions wrong | Operators can chase the wrong manifests if they assume these are fully standalone repo-managed installs |
| kubeconfig fetch skipped | Local kubectl access remains pointed at 127.0.0.1 from the server-local file |
Verification
kubectl get nodes -o wide
kubectl get pods -n kube-system
kubectl get helmchartconfig -n kube-system traefik -o yaml
rg -n "k3s_version|k3s_cluster_cidr|k3s_service_cidr|zot_mirror|tls-san" \
lumie-infra/provision/ansible/group_vars \
lumie-infra/provision/ansible/roles