Observability Overview
Lumie's observability stack is GitOps-managed from lumie-infra/observability/**, with UI access mostly exposed through Teleport rather than public ingress.
Boundaries
| Surface | Primary source paths | Namespace | Notes |
|---|---|---|---|
| Prometheus | lumie-infra/observability/prometheus/** | prometheus | Rule evaluation, short local retention, OTLP receiver enabled. |
| Grafana | lumie-infra/observability/grafana/** | grafana | UI and datasource hub, backed by infra-db. |
| Loki | lumie-infra/observability/loki/** | loki | Short-retention log store on local filesystem in emptyDir. |
| Tempo | lumie-infra/observability/tempo/** | tempo | Short-retention trace store on emptyDir. |
| OpenTelemetry operator | lumie-infra/observability/opentelemetry-operator/** | opentelemetry-operator | Installs CRDs, webhooks, target allocator support. |
| OpenTelemetry collector | lumie-infra/observability/opentelemetry/** | opentelemetry | DaemonSet that scrapes, collects logs, and exports telemetry. |
| Alertmanager and Karma | lumie-infra/observability/alertmanager/** | alertmanager | Alert routing and UI. |
| Blackbox, Goldilocks, KSM, node-exporter, VPA, Thanos | lumie-infra/observability/* | per app | Auxiliary probes, exporters, and recommendation services. |
Runtime flow
Non-obvious platform decisions
- OpenTelemetry, not Prometheus, performs most workload scraping. The collector's Prometheus receiver uses the Target Allocator to read ServiceMonitor and PodMonitor resources and then exports metrics into Prometheus over OTLP HTTP.
- Prometheus keeps only three days of local retention and does not currently upload blocks to object storage.
- Thanos is therefore a query and deduplication layer today, not a long-term metrics archive.
- Loki and Tempo both run with
emptyDirstorage and about three days of retention, so pod replacement removes historical data. - Teleport publishes operator UIs for Grafana, Prometheus, Alertmanager, Karma, Goldilocks, and other tools from
lumie-infra/security/teleport/agent/helm-values.yaml.
Common failure patterns
- If the OpenTelemetry collector is unhealthy, metrics, logs, and traces all degrade at once because it is the fan-out hub.
- If Prometheus is healthy but rules are firing strangely, verify whether the collector is still scraping the right ServiceMonitors.
- If Grafana dashboards load but show missing history, check whether the requested time range exceeds the local retention of Prometheus, Loki, or Tempo.
- If an operator UI is unreachable, verify Teleport app registration before assuming the underlying service is down.
Verification
kubectl get applications.argoproj.io -n argocd | rg 'observability|prometheus|grafana|loki|tempo|opentelemetry|alertmanager|thanos'
kubectl get pods -n prometheus
kubectl get pods -n opentelemetry
kubectl get pods -n grafana
kubectl get pods -n loki
kubectl get pods -n tempo