Tempo

Tempo는 Lumie 인프라의 분산 트레이싱 시스템으로, 애플리케이션 및 서비스 간의 요청 흐름을 추적하고 성능 병목 지점을 식별하는 데 사용됩니다.

아키텍처

배포 구성

네임스페이스: tempo
차트: tempo v1.17.0
이미지: zot.lumie-infra.com/grafana/tempo:2.10.0
모드: 단일 바이너리 (모놀리식)
복제본: 1개

트레이싱 파이프라인

수신기 설정

OTLP 프로토콜

Tempo는 OpenTelemetry 표준 프로토콜을 통해 트레이스 데이터를 수신합니다:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

서비스 설정

service:
  type: ClusterIP
  # 포트 4317 (gRPC), 4318 (HTTP)

저장 구성

로컬 파일시스템

단순성을 위해 로컬 파일시스템을 사용합니다:

storage:
  trace:
    backend: local
    local:
      path: /var/tempo/traces
    wal:
      path: /var/tempo/wal

데이터 보존

retention: 72h  # 3일간 트레이스 보존

볼륨 설정

persistence:
  enabled: false  # emptyDir 사용

extraVolumes:
  - name: data
    emptyDir: {}
extraVolumeMounts:
  - name: data
    mountPath: /var/tempo

메트릭 생성기

트레이스 기반 메트릭

메트릭 생성기는 현재 비활성화되어 있습니다. Prometheus의 remote-write receiver가 활성화되어 있지 않은 상태에서 메트릭 생성기를 켜면 WAL이 무제한 증가하여 OOM을 유발할 수 있습니다. 서비스 계측이 완료되고 Prometheus에 --web.enable-remote-write-receiver 옵션이 설정된 후 재활성화 예정입니다:

metricsGenerator:
  enabled: false
  # remoteWriteUrl: http://prometheus-kube-prometheus-prometheus.prometheus.svc:9090/api/v1/write

리소스 설정

컴퓨팅 리소스

단일 바이너리 모드는 distributor/ingester/querier/compactor를 하나의 프로세스에서 실행하므로 베이스라인 메모리가 높습니다. Go GC가 컨테이너 메모리 한계 내에서 동작하도록 GOMEMLIMIT을 설정합니다:

resources:
  requests:
    cpu: 15m
    memory: 256Mi
  limits:
    memory: 512Mi  # CPU 제한 없음 (안정성 우선)

extraEnv:
  - name: GOMEMLIMIT
    value: "460MiB"

priorityClassName: medium-priority

모니터링

ServiceMonitor

Tempo 자체의 메트릭을 Prometheus에서 수집합니다:

serviceMonitor:
  enabled: true
  additionalLabels:
    release: prometheus

주요 메트릭

tempo_ingester_traces_created_total: 생성된 트레이스 수
tempo_ingester_blocks_flushed_total: 플러시된 블록 수
tempo_request_duration_seconds: 요청 처리 시간
tempo_ingester_live_traces: 활성 트레이스 수

트레이스 쿼리

Grafana를 통한 접근

Tempo는 주로 Grafana의 Explore 기능을 통해 접근합니다:

Grafana에서 Explore 선택
데이터 소스를 Tempo로 변경
트레이스 ID 입력 또는 서비스/오퍼레이션으로 검색

트레이스 ID 검색

# 특정 트레이스 ID로 검색
# Grafana Explore에서 트레이스 ID 입력

서비스 맵

Grafana에서 서비스 간의 의존성을 시각화할 수 있습니다.

애플리케이션 계측

OpenTelemetry SDK 설정

애플리케이션에서 트레이싱을 활성화하려면 OpenTelemetry SDK를 설정해야 합니다:

Java 예제

// build.gradle
implementation 'io.opentelemetry:opentelemetry-api:1.32.0'
implementation 'io.opentelemetry:opentelemetry-sdk:1.32.0'
implementation 'io.opentelemetry:opentelemetry-exporter-otlp:1.32.0'

// 설정
OpenTelemetry openTelemetry = OpenTelemetrySDK.builder()
    .setTracerProvider(
        SdkTracerProvider.builder()
            .addSpanProcessor(BatchSpanProcessor.builder(
                OtlpGrpcSpanExporter.builder()
                    .setEndpoint("http://otel-collector.opentelemetry.svc.cluster.local:4317")
                    .build())
                .build())
            .setResource(Resource.getDefault()
                .merge(Resource.builder()
                    .put(ResourceAttributes.SERVICE_NAME, "my-service")
                    .build()))
            .build())
    .build();

Node.js 예제

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-otlp-grpc');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://otel-collector.opentelemetry.svc.cluster.local:4317',
  }),
  serviceName: 'my-service',
});

sdk.start();

환경 변수 설정

env:
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: "http://otel-collector.opentelemetry.svc.cluster.local:4317"
  - name: OTEL_SERVICE_NAME
    value: "my-service"
  - name: OTEL_RESOURCE_ATTRIBUTES
    value: "service.version=1.0.0,deployment.environment=production"

트레이스 분석

성능 분석

트레이스를 통해 다음을 분석할 수 있습니다:

전체 요청 지연 시간: 요청의 시작부터 끝까지
서비스별 지연 시간: 각 서비스에서 소요된 시간
병목 지점: 가장 오래 걸리는 오퍼레이션
에러 전파: 에러가 발생한 지점과 전파 경로

상관 관계 분석

Grafana에서 트레이스와 메트릭/로그를 연결하여 분석:

트레이스 → 메트릭: 트레이스 ID로 관련 메트릭 조회
트레이스 → 로그: 트레이스 ID로 관련 로그 조회
메트릭 → 트레이스: 이상 메트릭에서 관련 트레이스 조회

문제 해결

트레이스가 수집되지 않는 경우

# Tempo 상태 확인
kubectl logs -n tempo deployment/tempo

# OpenTelemetry Collector 상태 확인
kubectl logs -n opentelemetry daemonset/otel-collector-collector

# OTLP 엔드포인트 연결 테스트
kubectl exec -n tempo deployment/tempo -- \
  curl -v http://otel-collector.opentelemetry.svc.cluster.local:4317

트레이스 검색이 안 되는 경우

# Tempo API 상태 확인
kubectl port-forward -n tempo svc/tempo 3100:3100
curl -s "http://localhost:3100/api/search/tags"

# 트레이스 수 확인
curl -s "http://localhost:3100/metrics" | grep tempo_ingester_traces_created_total

높은 메모리 사용량

# 활성 트레이스 수 확인
curl -s "http://tempo.tempo.svc.cluster.local:3100/metrics" | \
  grep tempo_ingester_live_traces

# 파드 재시작 (메모리 정리)
kubectl rollout restart -n tempo deployment/tempo

성능 최적화

샘플링 설정

모든 트레이스를 수집하면 성능에 영향을 줄 수 있으므로 적절한 샘플링 비율을 설정합니다:

# 애플리케이션 설정
env:
  - name: OTEL_TRACES_SAMPLER
    value: "traceidratio"
  - name: OTEL_TRACES_SAMPLER_ARG
    value: "0.1"  # 10% 샘플링

배치 처리

트레이스 데이터를 배치로 전송하여 네트워크 오버헤드를 줄입니다:

# OpenTelemetry SDK 설정
processors:
  batch:
    timeout: 1s
    send_batch_size: 1024

보안 고려사항

네트워크 보안

클러스터 내부 통신만 허용
TLS 암호화 (내부 통신)
외부 접근은 Grafana를 통해서만 가능

데이터 보호

민감한 정보가 포함된 스팬 속성 필터링
트레이스 ID 기반 접근 제어

트레이싱 모범 사례

스팬 설계

의미 있는 스팬 이름: 오퍼레이션을 명확히 표현
적절한 스팬 계층: 너무 깊거나 얕지 않게
유용한 속성: 디버깅에 도움이 되는 정보 포함
에러 처리: 예외 발생 시 스팬에 에러 정보 기록

서비스 식별

# 서비스 속성 설정
resource:
  service.name: "user-service"
  service.version: "1.2.3"
  deployment.environment: "production"

상관 관계 ID

로그와 메트릭에 트레이스 ID를 포함하여 상관 관계 분석을 용이하게 합니다.

아키텍처​

배포 구성​

트레이싱 파이프라인​

수신기 설정​

OTLP 프로토콜​

서비스 설정​

저장 구성​

로컬 파일시스템​

데이터 보존​

볼륨 설정​

메트릭 생성기​

트레이스 기반 메트릭​

리소스 설정​

컴퓨팅 리소스​

모니터링​

ServiceMonitor​

주요 메트릭​

트레이스 쿼리​

Grafana를 통한 접근​

트레이스 ID 검색​

서비스 맵​

애플리케이션 계측​

OpenTelemetry SDK 설정​

Java 예제​

Node.js 예제​

환경 변수 설정​

트레이스 분석​

성능 분석​

상관 관계 분석​

문제 해결​

트레이스가 수집되지 않는 경우​

트레이스 검색이 안 되는 경우​

높은 메모리 사용량​

성능 최적화​

샘플링 설정​

배치 처리​

보안 고려사항​

네트워크 보안​

데이터 보호​

트레이싱 모범 사례​

스팬 설계​

서비스 식별​

상관 관계 ID​

관련 문서​