Kube State Metrics

Kube State Metrics(KSM)는 Kubernetes API 서버를 지속적으로 감시하여 클러스터 내 오브젝트(파드, 디플로이먼트, 서비스 등)의 상태를 메트릭으로 노출하는 서비스입니다. 노드 수준의 성능 메트릭을 제공하는 Node Exporter와 달리, KSM은 Kubernetes 오브젝트의 상태 및 메타데이터 정보에 집중합니다.

배포 구성

네임스페이스: kube-state-metrics
차트: kube-state-metrics v5.25.1 (prometheus-community)
이미지: zot.lumie-infra.com/kube-state-metrics/kube-state-metrics:v2.13.0
복제본: 1개 (stateless이지만 중복 메트릭 방지를 위해 단일 인스턴스 권장)

아키텍처

서비스 구성

헤드리스 서비스로 구성되어 DNS 기반 직접 파드 접근이 가능합니다:

service:
  type: ClusterIP
  clusterIP: None

Prometheus 통합

ServiceMonitor 설정

prometheus:
  monitor:
    enabled: true
    additionalLabels:
      release: prometheus
    namespace: prometheus
    relabelings:
      - targetLabel: cluster
        replacement: "mayne-cluster"

모든 KSM 메트릭에 cluster: mayne-cluster 라벨이 추가되어 멀티 클러스터 환경에서도 구분이 가능합니다.

메트릭 레이블 재매핑

kube-state-metrics는 Kubernetes 오브젝트의 라벨을 exported_namespace, exported_pod, exported_container 형태로 노출합니다. Grafana 대시보드와의 호환성을 위해 이를 표준 라벨로 복사합니다:

metricRelabelings:
  - sourceLabels: [exported_namespace]
    targetLabel: namespace
    regex: (.+)
    replacement: ${1}
  - sourceLabels: [exported_pod]
    targetLabel: pod
    regex: (.+)
    replacement: ${1}
  - sourceLabels: [exported_container]
    targetLabel: container
    regex: (.+)
    replacement: ${1}

주요 메트릭

파드 상태

메트릭	설명
`kube_pod_status_phase`	파드의 현재 단계 (Pending/Running/Succeeded/Failed/Unknown)
`kube_pod_container_status_restarts_total`	컨테이너 재시작 횟수
`kube_pod_container_status_ready`	컨테이너 Ready 상태
`kube_pod_container_status_last_terminated_reason`	마지막 종료 사유 (OOMKilled 등)

디플로이먼트 상태

메트릭	설명
`kube_deployment_status_replicas_available`	사용 가능한 복제본 수
`kube_deployment_status_replicas_ready`	Ready 상태 복제본 수
`kube_deployment_spec_replicas`	원하는 복제본 수

노드 상태

메트릭	설명
`kube_node_status_condition`	노드 컨디션 상태 (Ready/MemoryPressure 등)
`kube_node_status_allocatable`	노드의 할당 가능한 리소스
`kube_node_spec_unschedulable`	스케줄링 불가 여부

PVC/스토리지

메트릭	설명
`kube_persistentvolumeclaim_status_phase`	PVC 상태 (Bound/Pending/Lost)
`kube_persistentvolume_status_phase`	PV 상태

활용 예시

OOM 킬 감지

Prometheus의 ContainerOOMKilled 알림 규칙에서 KSM 메트릭을 활용합니다:

increase(kube_pod_container_status_restarts_total[10m]) > 0
and on (namespace, pod, container)
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1

디플로이먼트 가용성 확인

# 원하는 복제본 수 대비 실제 가용 복제본 비율
kube_deployment_status_replicas_available
  / kube_deployment_spec_replicas

파드 재시작 빈도

# 지난 1시간 동안 재시작이 많은 컨테이너
topk(10, increase(kube_pod_container_status_restarts_total[1h]))

리소스 설정

resources:
  requests:
    cpu: 15m
    memory: 100Mi
  limits:
    memory: 100Mi  # CPU 제한 없음 (안정성 우선)

문제 해결

메트릭이 수집되지 않는 경우

# 파드 상태 확인
kubectl get pods -n kube-state-metrics

# ServiceMonitor 확인
kubectl get servicemonitor -n prometheus kube-state-metrics

# 직접 메트릭 엔드포인트 확인
kubectl port-forward -n kube-state-metrics svc/kube-state-metrics 8080:8080
curl http://localhost:8080/metrics | grep kube_pod_status_phase

라벨 충돌 문제

# exported_* 라벨과 표준 라벨 중복 확인
kubectl exec -n prometheus prometheus-kube-prometheus-prometheus-0 -- \
  promtool query instant 'kube_pod_info{pod="<pod-name>"}'

배포 구성​

아키텍처​

서비스 구성​

Prometheus 통합​

ServiceMonitor 설정​

메트릭 레이블 재매핑​

주요 메트릭​

파드 상태​

디플로이먼트 상태​

노드 상태​

PVC/스토리지​

활용 예시​

OOM 킬 감지​

디플로이먼트 가용성 확인​

파드 재시작 빈도​

리소스 설정​

문제 해결​

메트릭이 수집되지 않는 경우​

라벨 충돌 문제​

관련 문서​