Observability Overview

Lumie's observability stack is GitOps-managed from lumie-infra/observability/**, with UI access mostly exposed through Teleport rather than public ingress.

Boundaries

Surface	Primary source paths	Namespace	Notes
Prometheus	`lumie-infra/observability/prometheus/**`	`prometheus`	Rule evaluation, short local retention, OTLP receiver enabled.
Grafana	`lumie-infra/observability/grafana/**`	`grafana`	UI and datasource hub, backed by `infra-db`.
Loki	`lumie-infra/observability/loki/**`	`loki`	Short-retention log store on local filesystem in `emptyDir`.
Tempo	`lumie-infra/observability/tempo/**`	`tempo`	Short-retention trace store on `emptyDir`.
OpenTelemetry operator	`lumie-infra/observability/opentelemetry-operator/**`	`opentelemetry-operator`	Installs CRDs, webhooks, target allocator support.
OpenTelemetry collector	`lumie-infra/observability/opentelemetry/**`	`opentelemetry`	DaemonSet that scrapes, collects logs, and exports telemetry.
Alertmanager and Karma	`lumie-infra/observability/alertmanager/**`	`alertmanager`	Alert routing and UI.
kube-state-metrics and node-exporter	`lumie-infra/observability/*`	per app	Kubernetes object and node metrics exporters.

Runtime flow

Non-obvious platform decisions

OpenTelemetry, not Prometheus, performs most workload scraping. The collector's Prometheus receiver uses the Target Allocator to read ServiceMonitor and PodMonitor resources and then exports metrics into Prometheus over OTLP HTTP.
Prometheus keeps only three days of local retention and does not currently upload blocks to object storage.
Thanos is disabled for the 6Gi node footprint plan, so Grafana reads Prometheus directly.
Loki and Tempo both run with emptyDir storage and about three days of retention, so pod replacement removes historical data.
Teleport publishes operator UIs for Grafana, Prometheus, Alertmanager, Karma, and other active tools from lumie-infra/security/teleport/agent/helm-values.yaml.

Common failure patterns

If the OpenTelemetry collector is unhealthy, metrics, logs, and traces all degrade at once because it is the fan-out hub.
If Prometheus is healthy but rules are firing strangely, verify whether the collector is still scraping the right ServiceMonitors.
If Grafana dashboards load but show missing history, check whether the requested time range exceeds the local retention of Prometheus, Loki, or Tempo.
If an operator UI is unreachable, verify Teleport app registration before assuming the underlying service is down.

Verification

kubectl get applications.argoproj.io -n argocd | rg 'observability|prometheus|grafana|loki|tempo|opentelemetry|alertmanager'
kubectl get pods -n prometheus
kubectl get pods -n opentelemetry
kubectl get pods -n grafana
kubectl get pods -n loki
kubectl get pods -n tempo

Boundaries​

Runtime flow​

Non-obvious platform decisions​

Common failure patterns​

Verification​

Related pages​

Boundaries

Runtime flow

Non-obvious platform decisions

Common failure patterns

Verification

Related pages