Skip to main content

Prometheus

Prometheus is deployed from the kube-prometheus-stack chart, but Lumie uses it differently from a default installation: most workload scraping is delegated to the OpenTelemetry collector, and Prometheus mainly stores, evaluates, and serves metrics.

Source paths

  • lumie-infra/observability/prometheus/argocd.yaml
  • lumie-infra/observability/prometheus/helm-values.yaml
  • lumie-infra/observability/prometheus/common-values.yaml

Runtime role

  • Runs one Prometheus replica in the prometheus namespace.
  • Enables the native OTLP receiver with enableOTLPReceiver: true.
  • Keeps only 3d of local retention.
  • Enables the Thanos sidecar service, but does not configure object-store upload.
  • Sends alerts to the separate Alertmanager deployment in the alertmanager namespace.

Key contract

These values define the most important behavior:

enableOTLPReceiver: true
retention: 3d
serviceMonitorSelector:
matchLabels:
scrape-by: prometheus-only

Source path: lumie-infra/observability/prometheus/helm-values.yaml

That means Prometheus itself only scrapes a narrow set of explicitly labeled targets. Broad ServiceMonitor discovery happens in OpenTelemetry, not here.

What it stores and serves

  • OTLP-exported metrics from the OpenTelemetry collector
  • default kube-prometheus rules and recordings, with some noisy rule groups disabled
  • custom Lumie rule groups for:
    • OOM events
    • grading queues and latency
    • report generation queues and latency
    • analysis worker errors and latency
    • chatbot stream health
    • CNPG backup health
    • MinIO replication lag or failure

Dependencies

  • the OpenTelemetry collector for most target scraping
  • Alertmanager for notification delivery
  • Teleport for operator UI access
  • VaultStaticSecret rendering for registry credentials and some secret-backed config

Current limitations

  • Local retention is only three days.
  • Thanos object-store upload is intentionally disabled, so there is no repo-backed long-term Prometheus block archive today.
  • Some kube control-plane metrics such as kubeApiServer and kubeEtcd are explicitly disabled to reduce series count and memory pressure.

Failure modes

  • If the OpenTelemetry exporter to Prometheus breaks, Prometheus stays healthy while workload metrics silently stop arriving.
  • If a team assumes default kube-prometheus ServiceMonitor behavior, they will miss that Lumie has effectively inverted the scrape model through the collector.
  • If memory pressure grows, first check series-cardinality changes in ServiceMonitor targets or alert-rule growth before raising retention.

Verification

kubectl get applications.argoproj.io -n argocd prometheus
kubectl get pods -n prometheus
kubectl get prometheusrules -n prometheus
kubectl get servicemonitors -A
kubectl describe pod -n prometheus prometheus-prometheus-kube-prometheus-prometheus-0

Observability

  • Grafana uses Thanos as the default Prometheus-compatible datasource and keeps direct Prometheus as a secondary datasource.
  • Alerting routes are defined in Alertmanager.
  • The OTLP ingest path and target-discovery path are described in OpenTelemetry.