Prometheus
Prometheus is deployed from the kube-prometheus-stack chart, but Lumie uses it differently from a default installation: most workload scraping is delegated to the OpenTelemetry collector, and Prometheus mainly stores, evaluates, and serves metrics.
Source paths
lumie-infra/observability/prometheus/argocd.yamllumie-infra/observability/prometheus/helm-values.yamllumie-infra/observability/prometheus/common-values.yaml
Runtime role
- Runs one Prometheus replica in the
prometheusnamespace. - Enables the native OTLP receiver with
enableOTLPReceiver: true. - Keeps only
3dof local retention. - Enables the Thanos sidecar service, but does not configure object-store upload.
- Sends alerts to the separate Alertmanager deployment in the
alertmanagernamespace.
Key contract
These values define the most important behavior:
enableOTLPReceiver: true
retention: 3d
serviceMonitorSelector:
matchLabels:
scrape-by: prometheus-only
Source path: lumie-infra/observability/prometheus/helm-values.yaml
That means Prometheus itself only scrapes a narrow set of explicitly labeled targets. Broad ServiceMonitor discovery happens in OpenTelemetry, not here.
What it stores and serves
- OTLP-exported metrics from the OpenTelemetry collector
- default kube-prometheus rules and recordings, with some noisy rule groups disabled
- custom Lumie rule groups for:
- OOM events
- grading queues and latency
- report generation queues and latency
- analysis worker errors and latency
- chatbot stream health
- CNPG backup health
- MinIO replication lag or failure
Dependencies
- the OpenTelemetry collector for most target scraping
- Alertmanager for notification delivery
- Teleport for operator UI access
- VaultStaticSecret rendering for registry credentials and some secret-backed config
Current limitations
- Local retention is only three days.
- Thanos object-store upload is intentionally disabled, so there is no repo-backed long-term Prometheus block archive today.
- Some kube control-plane metrics such as
kubeApiServerandkubeEtcdare explicitly disabled to reduce series count and memory pressure.
Failure modes
- If the OpenTelemetry exporter to Prometheus breaks, Prometheus stays healthy while workload metrics silently stop arriving.
- If a team assumes default kube-prometheus ServiceMonitor behavior, they will miss that Lumie has effectively inverted the scrape model through the collector.
- If memory pressure grows, first check series-cardinality changes in ServiceMonitor targets or alert-rule growth before raising retention.
Verification
kubectl get applications.argoproj.io -n argocd prometheus
kubectl get pods -n prometheus
kubectl get prometheusrules -n prometheus
kubectl get servicemonitors -A
kubectl describe pod -n prometheus prometheus-prometheus-kube-prometheus-prometheus-0
Observability
- Grafana uses Thanos as the default Prometheus-compatible datasource and keeps direct Prometheus as a secondary datasource.
- Alerting routes are defined in Alertmanager.
- The OTLP ingest path and target-discovery path are described in OpenTelemetry.