Skip to main content

Keycloak

Keycloak provides infrastructure SSO for internal operator tools. In the current repo state it is primarily the infra realm for applications such as Coder, Vault, Zot, Gitea, and RabbitMQ.

Responsibility

  • Run Keycloak in the keycloak namespace.
  • Serve OIDC endpoints directly at auth.lumie-edu.com for browser-based flows.
  • Store realm state in the shared infra-db PostgreSQL cluster.
  • Converge declarative realm configuration through a post-sync job instead of kc.sh import.

Source paths

PathRole
lumie-infra/security/keycloak/argocd.yamlArgoCD Application with chart values, common chart values, and extra manifests
lumie-infra/security/keycloak/helm-values.yamlServer startup, DB, metrics, and health settings
lumie-infra/security/keycloak/common-values.yamlExternal ingress, realm import ConfigMap, and Vault secret rendering
lumie-infra/security/keycloak/manifests/realm-sync-job.yamlPostSync realm convergence job using keycloak-config-cli
lumie-infra/charts/common/templates/additional-ingresses.yamlTemplate that renders auth.lumie-edu.com
lumie-infra/charts/common/templates/vault-static-secrets.yamlTemplate used by the local Vault secret declarations
lumie-infra/security/teleport/agent/helm-values.yamlSeparate Teleport app entry for the Keycloak UI

Public surface and contracts

SurfaceContract
Direct OIDC ingresshttps://auth.lumie-edu.com
In-cluster HTTP servicekeycloak-keycloakx-http.keycloak.svc.cluster.local
DatabaseShared infra-db, database keycloak, user keycloak
Admin bootstrap secretkeycloak-secrets
Realm sync secretkeycloak-oauth-secrets
Teleport appkeycloak

The direct ingress matters because browser OIDC flows for clients such as Coder and Vault cannot terminate through the Teleport app proxy alone.

Runtime flow

Declared realm contract

The checked-in infra-realm.json defines:

  • realm infra;
  • realm roles admin and developer;
  • client scopes rabbitmq-admin and groups;
  • OIDC clients:
    • coder
    • vault
    • zot
    • gitea
    • rabbitmq
  • user bluemayne with realm roles admin and developer.

Client secrets are not hardcoded in Git. The keycloak-realm-sync job reads them from keycloak-oauth-secrets, substitutes them into the realm JSON, and applies the result through the live admin REST API.

Why realm sync is a Job

The repo intentionally avoids kc.sh import --override for realm management. The post-sync job in realm-sync-job.yaml exists because:

  • REST-based creation preserves Keycloak's standard client-scope initialization.
  • secret placeholders are substituted client-side before the realm JSON is written to the database;
  • reruns are idempotent and safe after partial failure.

Failure behavior and operational risks

  • Missing OAuth secrets produce invalid_client failures during login even when the Keycloak pod itself is healthy.
  • If the PostSync job fails midway, the realm can be partially updated; the documented recovery path is to rerun the job, not to roll back the database manually.
  • Database bootstrap or admin-password secret failures block startup before realm sync runs.
  • Because direct ingress is separate from the Teleport app route, ingress or certificate failures can break OIDC while the Teleport-proxied UI still works.

Contract drift

There is a real checked-in mismatch to be aware of:

  • security/keycloak/helm-values.yaml comments describe realms infra, lumie, and lumie-dev.
  • security/keycloak/manifests/realm-sync-job.yaml comments also say the ConfigMap mounts infra-realm.json, lumie-realm.json, and lumie-dev-realm.json.
  • The current security/keycloak/common-values.yaml only renders infra-realm.json, and the current Vault secret template only exposes the infra client secrets used by that file.

Treat the current repo state as infra-realm-only unless the additional realm JSON files are added back.

Observability

  • serviceMonitor.enabled: true exposes Keycloak metrics to Prometheus.
  • health.enabled: true and the startup probe against /health cover boot readiness.
  • The keycloak-realm-sync job logs are the source of truth for realm convergence failures.

Verification

kubectl get applications.argoproj.io -n argocd keycloak
kubectl get statefulset,pods,svc,ingress,secrets -n keycloak
kubectl get jobs -n keycloak -l app.kubernetes.io/component=realm-sync
kubectl logs -n keycloak job/keycloak-realm-sync
kubectl port-forward -n keycloak svc/keycloak-keycloakx-http 8080:80
curl -sS -o /dev/null -w "%{http_code}\n" http://127.0.0.1:8080/health
curl -sS -o /dev/null -w "%{http_code}\n" \
https://auth.lumie-edu.com/realms/infra/.well-known/openid-configuration

Success signals:

  • The keycloak Argo CD application is Healthy and Synced.
  • StatefulSet keycloak is ready, and the direct-ingress Ingress for auth.lumie-edu.com exists.
  • Job keycloak-realm-sync completes successfully and does not repeat invalid_client or import-substitution errors in its logs.
  • GET /health on the in-cluster service returns HTTP 200.
  • The public OIDC discovery endpoint for realm infra returns HTTP 200.