Prometheus

Prometheus is deployed from the kube-prometheus-stack chart, but Lumie uses it differently from a default installation: most workload scraping is delegated to the OpenTelemetry collector, and Prometheus mainly stores, evaluates, and serves metrics.

Source paths

lumie-infra/observability/prometheus/argocd.yaml
lumie-infra/observability/prometheus/helm-values.yaml
lumie-infra/observability/prometheus/common-values.yaml

Runtime role

Runs one Prometheus replica in the prometheus namespace.
Enables the native OTLP receiver with enableOTLPReceiver: true.
Keeps only 3d of local retention.
Serves Grafana directly; Thanos is disabled.
Sends alerts to the separate Alertmanager deployment in the alertmanager namespace.

Key contract

These values define the most important behavior:

enableOTLPReceiver: true
retention: 3d
serviceMonitorSelector:
  matchLabels:
    scrape-by: prometheus-only

Source path: lumie-infra/observability/prometheus/helm-values.yaml

That means Prometheus itself only scrapes a narrow set of explicitly labeled targets. Broad ServiceMonitor discovery happens in OpenTelemetry, not here.

What it stores and serves

OTLP-exported metrics from the OpenTelemetry collector
default kube-prometheus rules and recordings, with some noisy rule groups disabled
custom Lumie rule groups for:
- OOM events
- grading queues and latency
- report generation queues and latency
- analysis worker errors and latency
- chatbot stream health
- CNPG backup health
- MinIO replication lag or failure

Dependencies

the OpenTelemetry collector for most target scraping
Alertmanager for notification delivery
Teleport for operator UI access
VaultStaticSecret rendering for registry credentials and some secret-backed config

Current limitations

Local retention is only three days.
Thanos is intentionally disabled, so there is no repo-backed long-term Prometheus block archive today.
Some kube control-plane metrics such as kubeApiServer and kubeEtcd are explicitly disabled to reduce series count and memory pressure.

Failure modes

If the OpenTelemetry exporter to Prometheus breaks, Prometheus stays healthy while workload metrics silently stop arriving.
If a team assumes default kube-prometheus ServiceMonitor behavior, they will miss that Lumie has effectively inverted the scrape model through the collector.
If memory pressure grows, first check series-cardinality changes in ServiceMonitor targets or alert-rule growth before raising retention.

Verification

kubectl get applications.argoproj.io -n argocd prometheus
kubectl get pods -n prometheus
kubectl get prometheusrules -n prometheus
kubectl get servicemonitors -A
kubectl describe pod -n prometheus prometheus-prometheus-kube-prometheus-prometheus-0

Observability

Grafana uses direct Prometheus as its default metrics datasource.
Alerting routes are defined in Alertmanager.
The OTLP ingest path and target-discovery path are described in OpenTelemetry.

Source paths​

Runtime role​

Key contract​

What it stores and serves​

Dependencies​

Current limitations​

Failure modes​

Verification​

Observability​

Related pages​