KEDA

KEDA(Kubernetes Event-driven Autoscaling)는 Lumie 플랫폼의 이벤트 기반 워크로드 자동 스케일링을 담당합니다. 메시지 큐 길이, 외부 메트릭 등 다양한 트리거를 기반으로 파드를 0개에서 n개까지 자동으로 조정합니다.

아키텍처

구성 요소

KEDA는 세 가지 주요 컴포넌트로 구성됩니다:

컴포넌트	역할
`keda-operator`	ScaledObject/ScaledJob 감시 및 HPA 관리
`keda-metrics-apiserver`	외부 메트릭을 Kubernetes Custom Metrics API로 노출
`keda-admission-webhooks`	ScaledObject 생성/수정 시 유효성 검증

배포 정보

네임스페이스: keda-system
차트 버전: 2.16.1 (kedacore/keda)
이미지 레지스트리: zot.lumie-infra.com (내부 미러)

설정

Helm Values

image:
  keda:
    registry: zot.lumie-infra.com
    repository: kedacore/keda
    tag: "2.16.1"
  metricsApiServer:
    registry: zot.lumie-infra.com
    repository: kedacore/keda-metrics-apiserver
    tag: "2.16.1"
  webhooks:
    registry: zot.lumie-infra.com
    repository: kedacore/keda-admission-webhooks
    tag: "2.16.1"

리소스 할당

VPA lowerBound 기반 요청값, CPU limit 없음:

resources:
  operator:
    requests:
      cpu: 15m
      memory: 100Mi
    limits:
      memory: 100Mi
  metricServer:
    requests:
      cpu: 15m
      memory: 100Mi
    limits:
      memory: 138Mi
  webhooks:
    requests:
      cpu: 15m
      memory: 100Mi
    limits:
      memory: 100Mi

Metrics API Server 프로브

metricsServer:
  livenessProbe:
    initialDelaySeconds: 25
    periodSeconds: 10
    timeoutSeconds: 5
    failureThreshold: 3
  readinessProbe:
    initialDelaySeconds: 20
    periodSeconds: 3
    timeoutSeconds: 5
    failureThreshold: 3

ArgoCD 배포

KEDA는 ArgoCD multi-source 패턴으로 배포됩니다. Helm 차트는 공식 저장소에서, values 파일은 lumie-infra 저장소에서 참조합니다:

sources:
  - repoURL: https://kedacore.github.io/charts
    chart: keda
    targetRevision: 2.16.1
    helm:
      valueFiles:
        - $values/platform/keda/helm-values.yaml
  - repoURL: https://github.com/Lumie-Edu/lumie-infra.git
    targetRevision: main
    ref: values

동기화 정책

syncPolicy:
  automated:
    prune: true
    selfHeal: true
  syncOptions:
    - CreateNamespace=true
    - ServerSideApply=true
  retry:
    limit: 5
    backoff:
      duration: 5s
      factor: 2
      maxDuration: 3m
  managedNamespaceMetadata:
    labels:
      goldilocks.fairwinds.com/enabled: "true"

ServerSideApply=true 옵션은 KEDA CRD 필드 매니저 충돌을 방지하기 위해 필수입니다.

ScaledObject 현행 구성

lumie-backend (CPU 기반, 워밍업 과잉 스케일링 방지)

JVM 초기화(JIT/classload/Hibernate) 시 CPU 급등이 불필요한 pod 추가를 유발하지 않도록 scaleUp stabilizationWindow + 1 pod/min 정책이 적용되어 있습니다:

# applications/lumie/backend/manifests/scaled-object.yaml
spec:
  minReplicaCount: 2
  maxReplicaCount: 5
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        scaleUp:
          stabilizationWindowSeconds: 120   # 120s 동안 지속된 경우만 스케일 업
          policies:
            - type: Pods
              value: 1
              periodSeconds: 60             # 최대 1 pod/분
  triggers:
    - type: cpu
      metricType: Utilization
      metadata:
        value: "70"

lumie-frontend (CPU 기반)

# applications/lumie/frontend/manifests/scaled-object.yaml
spec:
  minReplicaCount: 2
  maxReplicaCount: 5
  triggers:
    - type: cpu
      metricType: Utilization
      metadata:
        value: "70"

grading-svc (RabbitMQ 큐 기반)

# applications/lumie/worker/grading-svc/manifests/scaled-object.yaml
spec:
  minReplicaCount: 4
  maxReplicaCount: 5
  triggers:
    - type: rabbitmq
      metadata:
        protocol: amqp
        queueName: grading.omr-request
        mode: QueueLength
        value: "40"
      authenticationRef:
        name: rabbitmq-auth

운영

Goldilocks VPA 통합

keda-system 네임스페이스는 Goldilocks가 활성화되어 있어 VPA 권장 리소스를 확인할 수 있습니다:

goldilocks.fairwinds.com/enabled: "true"

트러블슈팅

KEDA 스케일링이 동작하지 않을 때 확인 사항:

ScaledObject 상태 확인

kubectl describe scaledobject <name> -n <namespace>

KEDA Operator 로그 확인

kubectl logs -n keda-system -l app=keda-operator

Metrics API Server 확인

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .

아키텍처​

구성 요소​

배포 정보​

설정​

Helm Values​

리소스 할당​

Metrics API Server 프로브​

ArgoCD 배포​

동기화 정책​

ScaledObject 현행 구성​

lumie-backend (CPU 기반, 워밍업 과잉 스케일링 방지)​

lumie-frontend (CPU 기반)​

grading-svc (RabbitMQ 큐 기반)​

운영​

Goldilocks VPA 통합​

트러블슈팅​

관련 문서​