feature: add rr_jobs_jobs_requeue metric#149
Conversation
Re-queued jobs returned nil from the protocol response handler and were counted as jobs_ok, so they were indistinguishable from finally-processed jobs. Add a dedicated rr_jobs_jobs_requeue counter, incremented in the response handler wherever a job is actually re-queued (Error+requeue, NACK with requeue, and explicit REQUEUE).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR implements job requeue tracking by introducing an Outcome enum to the response handler API, instrumenting a new Prometheus counter for requeued jobs, wiring metrics through the plugin, adding comprehensive tests, and enhancing CI with LocalStack health checks and configuration pinning. ChangesRequeue Outcome Tracking and Instrumentation
CI and Integration Test Infrastructure
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
Adds a dedicated rr_jobs_jobs_requeue metric to make job requeue outcomes observable in Prometheus, and wires the protocol response handler to emit this signal for requeue-related responses.
Changes:
- Introduces
jobs_requeuetracking in the JOBS Prometheus exporter and asserts it in the metrics integration test. - Extends the protocol response handler to accept a small metrics interface and increment requeue counts on
Error+requeue,NACK+requeue, and explicitREQUEUE. - Adds unit tests covering response→requeue-metric mapping and ensuring failed driver requeues are not counted.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
tests/jobs_general_test.go |
Asserts the new rr_jobs_jobs_requeue metric is present in exported metrics output. |
protocol/handler.go |
Adds a MetricsCounter interface and passes it into the response handler. |
protocol/handler_test.go |
Adds unit tests validating requeue metric increments for relevant response types. |
protocol/error.go |
Increments the requeue metric on successful requeue paths. |
plugin.go |
Wires plugin metrics exporter into the protocol response handler. |
metrics.go |
Adds jobs_requeue exporter state + descriptor and emits it during collection. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Handle now reports whether the job was re-queued; the listener increments jobs_ok only when it wasn't, so a re-queue is counted once (jobs_requeue) instead of also bumping jobs_ok.
TestJobsInit ran SQS against an untagged localstack/localstack image with only a blind sleep 30 for readiness, causing intermittent 'SQS CreateQueue: context deadline exceeded'. Pin LocalStack to 4.14.0 (SERVICES=sqs + eager loading) and replace the sleep with a bounded /_localstack/health SQS readiness gate that fails fast with the container logs.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/linux.yml:
- Line 69: The jq readiness check uses test("available|running") which can match
substrings like "unavailable"; update the regex used in that check (the
test(...) invocation) to anchor the alternatives so only exact statuses match
(e.g. use ^(...)$ or word boundaries around available|running) so the CI gate
only passes when SQS is actually "available" or "running".
In `@metrics.go`:
- Around line 111-115: The metrics exported by statsExporter.Collect() are
currently created with prometheus.GaugeValue but represent
monotonically-incremented counters (se.jobsOk, se.jobsErr, se.jobsRequeue,
se.pushOk, se.pushErr), so change their metric type to prometheus.CounterValue
when calling prometheus.MustNewConstMetric for se.jobsOkDesc, se.jobsErrDesc,
se.jobsRequeueDesc, se.pushOkDesc and se.pushErrDesc to ensure Prometheus treats
them as counters (use the same value retrieval via .Load() but pass
prometheus.CounterValue). Ensure the change is applied where
statsExporter.Collect() emits these metrics.
In `@protocol/handler.go`:
- Around line 64-68: Replace the boolean "requeued" return from
RespHandler.Handle with a small outcome enum (e.g., Outcome{OK, Failed,
Requeued}) so callers can distinguish success vs terminal failure vs requeue;
change Handle's signature to return (Outcome, error), update the implementations
that call/return from handleErrResp and handleNackResponse to return the correct
Outcome value (handleErrResp -> Failed or Requeued as appropriate,
handleNackResponse -> Failed or Requeued), and update the caller(s) such as the
listener logic that currently checks the boolean to instead branch on Outcome to
increment the correct metric (jobs_ok vs jobs_requeue vs final failures).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 46b5bdfc-da9e-403a-9396-09d35fd86c83
📒 Files selected for processing (9)
.github/workflows/linux.ymllistener.gometrics.goplugin.goprotocol/error.goprotocol/handler.goprotocol/handler_test.gotests/env/docker-compose-jobs.yamltests/jobs_general_test.go
The jobs RPC migrated to Connect-RPC (jobs.v2.JobsService), but the PHP spiral/roadrunner-jobs client (v4.7.0, latest) still calls the legacy goridge jobs.List method, so $jobs->count() in server.on_init misreads the Connect response and OOMs the worker. Skip pending a Connect-RPC-capable PHP client; tracked in roadrunner#2085. Unrelated to the requeue metric and SQS CI changes in this PR.
The unanchored test("available|running") also matched substrings such as "unavailable", so the gate could pass before SQS was ready. Anchor with ^(...)$ to match only the exact ready states.
rr_jobs_jobs_{ok,err,requeue} and rr_jobs_push_{ok,err} increment monotonically, so export them as prometheus.CounterValue (were GaugeValue); gauges break PromQL rate()/increase().
Handle returns an Outcome enum (ok/failed/requeued) instead of a requeued bool; the listener records the matching metric. Terminal failures (Error or NACK without requeue) now count as rr_jobs_jobs_err instead of rr_jobs_jobs_ok. The protocol package no longer depends on a metrics interface.
Adds a dedicated
rr_jobs_jobs_requeuecounter so re-queued jobs can be told apart from finally-failed and successfully-processed ones. Until now a re-queue returnednilfrom the protocol handler and was tallied asrr_jobs_jobs_ok, so it was invisible.rr_jobs_jobs_requeuemetric on the JOBS stats exporter, mirroringjobs_ok/jobs_errError+requeue,NACKwithrequeue, and explicitREQUEUEMetricsCounterinterface, so theprotocolpackage stays free of ajobsimport cyclerr_jobs_jobs_requeuealso asserted in the metrics integration testcloses roadrunner-server/roadrunner#1566
Summary by CodeRabbit
New Features
Improvements