OMR Grading Scaling

Use this guide when OMR batch grading slows down, queues build up, grading pods OOM, callbacks arrive out of order, or KEDA does not produce the expected throughput.

The incident behind this runbook happened during the April 7-8, 2026 OMR scaling push. The original shape was:

RabbitMQ -> Spring listener -> HTTP POST -> grading-svc

The recovered shape is:

RabbitMQ -> grading-svc direct consumer -> MinIO image fetch -> OpenCV grading
         -> backend internal callback -> backend DB write and progress tracking

Symptom

Treat the old intermediary pattern as suspect when these signals appear together:

grading-svc OOMKills while RabbitMQ still has queued work.
Spring listener concurrency changes move the failure around instead of fixing it.
KEDA creates pods, but the batch finishes before cold pods contribute meaningful work.
CPU-bound OpenCV work shows long p95 latency even without Kubernetes CPU limits.
Backend callbacks need tenant, auth, and idempotency context that is hard to preserve through an HTTP relay.
Parallel grading creates duplicate ExamResult rows for the same student and exam.

Likely Cause

The most important failure was not the pod size. The queue consumer and the worker were different processes, so RabbitMQ backpressure stopped at Spring instead of at the actual image-processing work.

With RabbitMQ -> Spring -> HTTP -> grading-svc, there were two independent concurrency knobs:

Layer	Knob	Failure mode
Spring	listener concurrency	pushes more HTTP work than the worker can absorb
Worker	replicas, semaphore, memory, OpenCV threads	sees bursts that are no longer shaped by RabbitMQ prefetch

That split made prefetch ineffective as the main safety control. It also made tenant headers, internal API auth, duplicate-result protection, and callback semantics harder to reason about.

Diagnostic Path

Check whether the process consuming the RabbitMQ message is also the process doing the expensive image work.
Inspect RabbitMQ prefetch. In the direct-consumer model, each grading pod should own a small bounded number of in-flight messages.
Compare queue drain time with pod startup time. If pod startup plus OpenCV import warmup is about 40 seconds and batches finish sooner, minReplicaCount is the real capacity setting.
Compare omr_grade timing with download, exam fetch, and callback timing. In the June 2026 follow-up, omr_grade dominated the request lifecycle.
Check CPU limits before raising memory. CPU limits can create throttling for OpenCV-heavy work; Lumie keeps no CPU limit on this path and uses a realistic CPU request.
Check database uniqueness for parallel result creation. Code-level "find then create" logic is not enough when multiple pods grade the same exam concurrently.

Useful read-only probes:

kubectl -n applications get scaledobject grading-svc -o yaml
kubectl -n applications get deploy grading-svc -o yaml
kubectl -n applications top pod -l app.kubernetes.io/name=grading-svc
kubectl -n applications logs deploy/grading-svc --tail=200

Fix

Use the direct worker-consumer pattern:

Concern	Before	After
Message consumer	Spring listener	`grading-svc` with `aio-pika`
Image access	Spring HTTP multipart relay	worker downloads from MinIO
Result persistence	worker or relay path owns too much state	backend internal callback owns DB writes
Backpressure	split across Spring and worker	RabbitMQ prefetch plus pod count
Scaling unit	HTTP requests into one service	one pod equals one or a few bounded consumers

Keep the runtime assumptions explicit:

Set RabbitMQ prefetch to the intended per-pod in-flight work.
Keep CPU limits off for the CPU-bound grading worker unless production evidence says otherwise.
Keep CPU requests honest enough for scheduling. The June 2026 production path used 250m.
Cap OpenCV and native numeric library thread pools so queue concurrency does not multiply into excessive native threads.
Use a warm pool when batches are short enough that cold-start scale-up cannot help the first batch.

Verification

The April 2026 architecture recovery produced these scaling signals:

Metric	Before	After
170-image batch	minutes with failures	24s with 10 warm pods
89-image batch	minutes	about 11s with 6 pods
Peak memory per pod	OOMKilled at 512Mi	about 155Mi
Failure rate	high	0% in the measured run

The June 21, 2026 performance follow-up then improved the 100-image production command path:

State	Wall time	Usecase avg	`omr_grade` avg	`omr_grade` p95
Original 50m request	49.3s	3970ms	3828ms	9375ms
Request raised to 250m	38.3s	3176ms	2971ms	5266ms
Optimized recognizer + 250m + thread env	21.5s	1743ms	1327ms	2011ms

Correctness still owns the release gate. Performance changes to OMR recognition must pass the golden real-scan corpus before rollout. The June 2026 optimization preserved output across 706 deduplicated scanned sheets.

Prevention

Keep consumer = worker for one-message-one-job queues unless a real fan-out router is needed.
Prefer broker-level backpressure over ad hoc semaphores for queue concurrency.
Keep a unique database constraint for (exam_id, student_id) style result ownership.
Treat KEDA minReplicaCount as the real capacity setting for short one-shot user batches.
Measure the expensive stage before changing Kubernetes resources.
Keep the OMR golden corpus runner current whenever recognition logic changes.

Source Incident Detail

The April 8, 2026 source incident records the scaling failure before the later June accuracy and performance work. The first architecture tried to coordinate one user's OMR grading burst through too many application-level concurrency controls and callback assumptions.

Detail	Value
Affected workflow	OMR batch grading
Initial failure shape	11 cascading failures in the first architecture
Final model	one RabbitMQ message per grading job, broker backpressure, KEDA-backed workers
Important database invariant	result ownership needs uniqueness such as `(exam_id, student_id)`
Later validation	June work preserved output over 706 deduplicated scans

The source record grouped the original anti-patterns into these classes:

Class	Failure mode
fan-out in the wrong layer	the application tried to coordinate work that belonged in the broker
weak idempotency	repeated callbacks and retries could drift job/result state
ad hoc concurrency	semaphores and local coordination fought the queue model
missing capacity floor	short user-triggered batches could finish before autoscaling reacted
insufficient observability	expensive stages were not isolated before resource changes

The second architecture kept the queue model simple: a message represents a unit of work, the broker controls delivery, workers process and callback, and KEDA keeps a warm minimum capacity for the expected short-batch shape.

When changing this path, keep the troubleshooting order:

Verify message shape and queue ownership before changing worker code.
Verify idempotency and unique constraints before increasing concurrency.
Verify KEDA minimum capacity before relying on scale-from-zero behavior.
Verify recognition correctness before accepting any performance improvement.

Symptom​

Likely Cause​

Diagnostic Path​

Fix​

Verification​

Prevention​

Source Incident Detail​