Skip to main content

Cluster Overview

Purpose

Lumie's infrastructure cluster is an OCI-hosted K3s environment bootstrapped by Terraform and Ansible, then operated through Argo CD App-of-Apps. This page is an overview document for engineers changing cluster provisioning, bootstrap order, or shared platform infrastructure.

Use this page to orient before going deeper into Kubernetes, Terraform, Ansible, Services, or Networking Overview.

Source Paths

PathRole
lumie-infra/provision/terraform/OCI networking, compute, block volume, and load balancer provisioning
lumie-infra/provision/ansible/OS preparation, K3s installation, storage mount, kubeconfig fetch, and Argo CD bootstrap
lumie-infra/applications/kustomization.yamlTop-level Argo CD application catalog, including cluster-bootstrap
lumie-infra/bootstrap/kustomization.yamlWave -1 foundation apps such as MinIO, Zot, Vault, and Gitea
lumie-infra/platform/kustomization.yamlCluster-wide platform apps such as CoreDNS config, Traefik config, cert-manager, RabbitMQ, and KEDA
lumie-infra/applications/cluster-bootstrap/Wave -2 cluster-scoped resources such as ClusterIssuer and StorageClass

Ownership Boundaries

LayerOwner in repoWhat it controls
OCI infrastructureprovision/terraformVCNs, subnets, peering, instances, MinIO block volumes, and public NLBs
Node bootstrapprovision/ansibleUbuntu prep, K3s install, registry mirror, MinIO disk mount, bootstrap secrets, and initial Argo CD install
Desired state after bootstrapArgo CD applications in bootstrap/, platform/, storage/, security/, observability/, and applications/Namespaces, Helm releases, raw manifests, and ongoing self-heal
Product workloadslumie-infra/applications/lumie/** plus app reposWorkload deployment shape, ingress, secrets wiring, scaling, and service exposure
Day-two verificationread-only kubectl against the live clusterConfirms that Git state and live state still match

The intended control flow is one-way: Terraform creates hosts and network, Ansible turns those hosts into a K3s cluster and seeds Argo CD, then Argo CD owns nearly every steady-state Kubernetes object.

Runtime Flow

Cluster Shape

The current design in code is:

  • OCI across two tenancies, 0214 and 0213, connected with local peering gateways.
  • One K3s server in tenancy 0214 and workers spread across both tenancies.
  • Public HTTPS ingress through an OCI layer-4 NLB that forwards :443 to worker nodes.
  • Cluster-scoped bootstrap resources applied before normal platform applications.
  • Shared platform services such as Vault, RabbitMQ, observability, and CI/CD managed as Argo CD applications.

The live cluster inspected on June 14, 2026 reported four ready nodes: k3s-master, k3s-worker-2, k3s-worker-3, and k3s-worker-4. Treat the checked-in terraform.tfvars.example as illustrative, not authoritative, because it still includes worker-1.

Operational Notes

  • The Kubernetes API join target is hard-coded in Ansible defaults as 10.0.0.241:6443, so master private-IP changes must be reflected in inventory and group vars before worker rejoin operations.
  • The local frontend development workflow does not run the Next.js app in-cluster. Only dev API ingress lives on dev.lumie-infra.com; the frontend itself runs locally through Tilt and HMR. See Tilt.
  • Cluster bootstrap is split into two waves: cluster-bootstrap at sync wave -2 for cluster-scoped prerequisites, then bootstrap at sync wave -1 for MinIO, Zot, Vault, and Gitea.
  • Argo CD self-heal means ad hoc kubectl apply changes to managed resources are expected to drift back to Git.

Contract Drift To Know About

Inspected sources disagree in a few places:

  • lumie-infra/README.md still describes Kong as the ingress layer, but lumie-infra/AGENTS.md, the platform manifests, and the live cluster all show Traefik as the active ingress controller.
  • lumie-infra/README.md still describes a five-node cluster and an older service inventory. The live cluster inspection on June 14, 2026 showed four ready nodes.
  • The repo still contains bootstrap/kong/, but lumie-infra/bootstrap/kustomization.yaml does not register bootstrap/kong/argocd.yaml, and no live argocd Application or kong namespace exists.

These mismatches matter for operators because the repo contains both active desired state and retained legacy artifacts.

Failure Modes

Failure pointImpact
Terraform networking driftWorkers may lose reachability to the master API or public ingress
Ansible bootstrap driftNew nodes may not join, MinIO disks may not mount, or Argo CD may not seed correctly
Vault unavailable during bootstrapDownstream VaultStaticSecret consumers stay blocked even when their apps sync
Missing Argo CD app registrationA directory can exist in Git without ever reaching the cluster
Live hotfixes outside GitArgo CD can overwrite them on the next sync

Verification

Use both repo and cluster checks:

kubectl get nodes -o wide
kubectl get applications -n argocd
kubectl get ns
rg -n "cluster-bootstrap|bootstrap|platform|applications" \
lumie-infra/applications/kustomization.yaml \
lumie-infra/bootstrap/kustomization.yaml \
lumie-infra/platform/kustomization.yaml

For provisioning changes, continue with the layer-specific pages: