Skip to main content

Observability Overview

Lumie's observability stack is GitOps-managed from lumie-infra/observability/**, with UI access mostly exposed through Teleport rather than public ingress.

Boundaries

SurfacePrimary source pathsNamespaceNotes
Prometheuslumie-infra/observability/prometheus/**prometheusRule evaluation, short local retention, OTLP receiver enabled.
Grafanalumie-infra/observability/grafana/**grafanaUI and datasource hub, backed by infra-db.
Lokilumie-infra/observability/loki/**lokiShort-retention log store on local filesystem in emptyDir.
Tempolumie-infra/observability/tempo/**tempoShort-retention trace store on emptyDir.
OpenTelemetry operatorlumie-infra/observability/opentelemetry-operator/**opentelemetry-operatorInstalls CRDs, webhooks, target allocator support.
OpenTelemetry collectorlumie-infra/observability/opentelemetry/**opentelemetryDaemonSet that scrapes, collects logs, and exports telemetry.
Alertmanager and Karmalumie-infra/observability/alertmanager/**alertmanagerAlert routing and UI.
Blackbox, Goldilocks, KSM, node-exporter, VPA, Thanoslumie-infra/observability/*per appAuxiliary probes, exporters, and recommendation services.

Runtime flow

Non-obvious platform decisions

  • OpenTelemetry, not Prometheus, performs most workload scraping. The collector's Prometheus receiver uses the Target Allocator to read ServiceMonitor and PodMonitor resources and then exports metrics into Prometheus over OTLP HTTP.
  • Prometheus keeps only three days of local retention and does not currently upload blocks to object storage.
  • Thanos is therefore a query and deduplication layer today, not a long-term metrics archive.
  • Loki and Tempo both run with emptyDir storage and about three days of retention, so pod replacement removes historical data.
  • Teleport publishes operator UIs for Grafana, Prometheus, Alertmanager, Karma, Goldilocks, and other tools from lumie-infra/security/teleport/agent/helm-values.yaml.

Common failure patterns

  • If the OpenTelemetry collector is unhealthy, metrics, logs, and traces all degrade at once because it is the fan-out hub.
  • If Prometheus is healthy but rules are firing strangely, verify whether the collector is still scraping the right ServiceMonitors.
  • If Grafana dashboards load but show missing history, check whether the requested time range exceeds the local retention of Prometheus, Loki, or Tempo.
  • If an operator UI is unreachable, verify Teleport app registration before assuming the underlying service is down.

Verification

kubectl get applications.argoproj.io -n argocd | rg 'observability|prometheus|grafana|loki|tempo|opentelemetry|alertmanager|thanos'
kubectl get pods -n prometheus
kubectl get pods -n opentelemetry
kubectl get pods -n grafana
kubectl get pods -n loki
kubectl get pods -n tempo