Skip to main content

Thanos

Thanos is present in Lumie, but only part of the usual Thanos stack is active. Today it is a query layer in front of Prometheus sidecars, not a long-term metrics archive.

Source paths

  • lumie-infra/observability/thanos/argocd.yaml
  • lumie-infra/observability/thanos/helm-values.yaml
  • lumie-infra/observability/thanos/common-values.yaml
  • lumie-infra/observability/prometheus/helm-values.yaml

Active components

  • query.enabled: true
  • queryFrontend.enabled: false
  • storegateway.enabled: false
  • compactor.enabled: false
  • receive.enabled: false
  • ruler.enabled: false

What it really does today

  • Grafana uses Thanos as its default Prometheus-compatible datasource.
  • Query uses DNS discovery against prometheus-kube-prometheus-thanos-discovery in the prometheus namespace.
  • Deduplication is enabled through --query.replica-label=prometheus_replica, even though Lumie currently runs a single Prometheus replica.

Current limitation

existingObjstoreSecret: thanos-objstore-secret is still rendered, but Prometheus has objectStorageConfig removed and Thanos Store Gateway and Compactor are off. In practice that means:

  • no long-term block upload from Prometheus
  • no historical object-store reads from Store Gateway
  • no compaction or downsampling jobs

Treat the deployment as query-only unless the repo re-enables those pieces.

Failure modes

  • Teams may assume Thanos implies long retention; the repo does not back that up today.
  • If Prometheus or its sidecar service is unavailable, Thanos Query loses data immediately because there is no Store Gateway fallback.
  • Because Grafana defaults to Thanos, datasource errors there can look like broad metrics outages even when Prometheus itself is healthy.

Verification

kubectl get applications.argoproj.io -n argocd thanos
kubectl get pods -n thanos
kubectl describe deploy -n thanos thanos-query
kubectl get svc -n prometheus | rg thanos