Cluster Overview
Purpose
Lumie's infrastructure cluster is an OCI-hosted K3s environment bootstrapped by Terraform and Ansible, then operated through Argo CD App-of-Apps. This page is an overview document for engineers changing cluster provisioning, bootstrap order, or shared platform infrastructure.
Use this page to orient before going deeper into Kubernetes, Terraform, Ansible, Services, or Networking Overview.
Source Paths
| Path | Role |
|---|---|
lumie-infra/provision/terraform/ | OCI networking, compute, block volume, and load balancer provisioning |
lumie-infra/provision/ansible/ | OS preparation, K3s installation, storage mount, kubeconfig fetch, and Argo CD bootstrap |
lumie-infra/applications/kustomization.yaml | Top-level Argo CD application catalog, including cluster-bootstrap |
lumie-infra/bootstrap/kustomization.yaml | Wave -1 foundation apps such as MinIO, Zot, Vault, and Gitea |
lumie-infra/platform/kustomization.yaml | Cluster-wide platform apps such as CoreDNS config, Traefik config, cert-manager, RabbitMQ, and KEDA |
lumie-infra/applications/cluster-bootstrap/ | Wave -2 cluster-scoped resources such as ClusterIssuer and StorageClass |
Ownership Boundaries
| Layer | Owner in repo | What it controls |
|---|---|---|
| OCI infrastructure | provision/terraform | VCNs, subnets, peering, instances, MinIO block volumes, and public NLBs |
| Node bootstrap | provision/ansible | Ubuntu prep, K3s install, registry mirror, MinIO disk mount, bootstrap secrets, and initial Argo CD install |
| Desired state after bootstrap | Argo CD applications in bootstrap/, platform/, storage/, security/, observability/, and applications/ | Namespaces, Helm releases, raw manifests, and ongoing self-heal |
| Product workloads | lumie-infra/applications/lumie/** plus app repos | Workload deployment shape, ingress, secrets wiring, scaling, and service exposure |
| Day-two verification | read-only kubectl against the live cluster | Confirms that Git state and live state still match |
The intended control flow is one-way: Terraform creates hosts and network, Ansible turns those hosts into a K3s cluster and seeds Argo CD, then Argo CD owns nearly every steady-state Kubernetes object.
Runtime Flow
Cluster Shape
The current design in code is:
- OCI across two tenancies,
0214and0213, connected with local peering gateways. - One K3s server in tenancy
0214and workers spread across both tenancies. - Public HTTPS ingress through an OCI layer-4 NLB that forwards
:443to worker nodes. - Cluster-scoped bootstrap resources applied before normal platform applications.
- Shared platform services such as Vault, RabbitMQ, observability, and CI/CD managed as Argo CD applications.
The live cluster inspected on June 14, 2026 reported four ready nodes: k3s-master, k3s-worker-2, k3s-worker-3, and k3s-worker-4. Treat the checked-in terraform.tfvars.example as illustrative, not authoritative, because it still includes worker-1.
Operational Notes
- The Kubernetes API join target is hard-coded in Ansible defaults as
10.0.0.241:6443, so master private-IP changes must be reflected in inventory and group vars before worker rejoin operations. - The local frontend development workflow does not run the Next.js app in-cluster. Only dev API ingress lives on
dev.lumie-infra.com; the frontend itself runs locally through Tilt and HMR. See Tilt. - Cluster bootstrap is split into two waves:
cluster-bootstrapat sync wave-2for cluster-scoped prerequisites, thenbootstrapat sync wave-1for MinIO, Zot, Vault, and Gitea. - Argo CD self-heal means ad hoc
kubectl applychanges to managed resources are expected to drift back to Git.
Contract Drift To Know About
Inspected sources disagree in a few places:
lumie-infra/README.mdstill describes Kong as the ingress layer, butlumie-infra/AGENTS.md, the platform manifests, and the live cluster all show Traefik as the active ingress controller.lumie-infra/README.mdstill describes a five-node cluster and an older service inventory. The live cluster inspection on June 14, 2026 showed four ready nodes.- The repo still contains
bootstrap/kong/, butlumie-infra/bootstrap/kustomization.yamldoes not registerbootstrap/kong/argocd.yaml, and no liveargocdApplication orkongnamespace exists.
These mismatches matter for operators because the repo contains both active desired state and retained legacy artifacts.
Failure Modes
| Failure point | Impact |
|---|---|
| Terraform networking drift | Workers may lose reachability to the master API or public ingress |
| Ansible bootstrap drift | New nodes may not join, MinIO disks may not mount, or Argo CD may not seed correctly |
| Vault unavailable during bootstrap | Downstream VaultStaticSecret consumers stay blocked even when their apps sync |
| Missing Argo CD app registration | A directory can exist in Git without ever reaching the cluster |
| Live hotfixes outside Git | Argo CD can overwrite them on the next sync |
Verification
Use both repo and cluster checks:
kubectl get nodes -o wide
kubectl get applications -n argocd
kubectl get ns
rg -n "cluster-bootstrap|bootstrap|platform|applications" \
lumie-infra/applications/kustomization.yaml \
lumie-infra/bootstrap/kustomization.yaml \
lumie-infra/platform/kustomization.yaml
For provisioning changes, continue with the layer-specific pages: