feature: add rr_jobs_jobs_requeue metric by rustatian · Pull Request #149 · roadrunner-server/jobs

rustatian · 2026-06-02T21:19:35Z

Adds a dedicated rr_jobs_jobs_requeue counter so re-queued jobs can be told apart from finally-failed and successfully-processed ones. Until now a re-queue returned nil from the protocol handler and was tallied as rr_jobs_jobs_ok, so it was invisible.

new rr_jobs_jobs_requeue metric on the JOBS stats exporter, mirroring jobs_ok / jobs_err
incremented in the protocol response handler wherever a job is actually re-queued: Error + requeue, NACK with requeue, and explicit REQUEUE
the handler takes a small MetricsCounter interface, so the protocol package stays free of a jobs import cycle
unit tests for the re-queue → metric mapping (incl. a failed re-queue not being counted); rr_jobs_jobs_requeue also asserted in the metrics integration test

closes roadrunner-server/roadrunner#1566

Summary by CodeRabbit

New Features
- Added Prometheus metric for tracking requeued jobs.
Improvements
- Enhanced job outcome tracking to clearly distinguish between successful, failed, and requeued states.
- Improved Docker Compose service health verification for test environments.

Re-queued jobs returned nil from the protocol response handler and were counted as jobs_ok, so they were indistinguishable from finally-processed jobs. Add a dedicated rr_jobs_jobs_requeue counter, incremented in the response handler wherever a job is actually re-queued (Error+requeue, NACK with requeue, and explicit REQUEUE).

coderabbitai · 2026-06-02T21:19:57Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eeb6bc00-e7d9-4430-9d68-a859ed3b42f6

📥 Commits

Reviewing files that changed from the base of the PR and between b4db317 and e7b0409.

📒 Files selected for processing (6)

.github/workflows/linux.yml
listener.go
metrics.go
protocol/error.go
protocol/handler.go
protocol/handler_test.go

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/linux.yml

📝 Walkthrough

Walkthrough

This PR implements job requeue tracking by introducing an Outcome enum to the response handler API, instrumenting a new Prometheus counter for requeued jobs, wiring metrics through the plugin, adding comprehensive tests, and enhancing CI with LocalStack health checks and configuration pinning.

Changes

Requeue Outcome Tracking and Instrumentation

Layer / File(s)	Summary
Outcome type definition and Handle signature `protocol/handler.go`	Introduce `Outcome` enum with `OutcomeOK`, `OutcomeFailed`, `OutcomeRequeued` constants and `String()` method; refactor `RespHandler.Handle()` to return `(Outcome, error)` and dispatch all response types through outcome-aware branches.
Error and requeue response handling `protocol/error.go`	Update `handleErrResp()`, `handleNackResponse()`, and `requeue()` to return `(Outcome, error)`; propagate `OutcomeRequeued` from successful requeue operations, `OutcomeFailed` from ACK/NACK paths, and `OutcomeOK` with unmarshal errors.
Requeue Prometheus metric instrumentation `metrics.go`	Add `jobsRequeue` atomic counter and `jobsRequeueDesc` descriptor; implement `CountJobRequeue()` method; update `newStatsExporter()`, `Describe()`, and `Collect()` to register and emit the new metric; switch existing metrics from `GaugeValue` to `CounterValue`.
Plugin and listener outcome dispatch to metrics `listener.go`	Add import alias for protocol package; capture `outcome` from `Handle()` in `Plugin.Execute`; dispatch on outcome to increment `CountJobOk()`, `CountJobErr()`, or `CountJobRequeue()` with debug logging; remove unconditional success counting.
Response handler outcome and requeue tests `protocol/handler_test.go`	Add `fakeJob` test mock and `discardLogger()` helper; implement `TestHandleOutcome()` table-driven test covering all response types and verifying returned `Outcome` and job state; add `TestHandleRequeueError()` to verify error propagation when `Requeue()` fails.

CI and Integration Test Infrastructure

Layer / File(s)	Summary
LocalStack configuration and GitHub Actions health polling `tests/env/docker-compose-jobs.yaml`, `.github/workflows/linux.yml`	Pin LocalStack Docker image to `4.14.0`; enable eager SQS loading via environment config; replace fixed sleep with 120-second health-check polling loop that verifies SQS service is available before running tests.
Integration test metric assertions and skip `tests/jobs_general_test.go`	Add assertions that `rr_jobs_jobs_requeue` metric equals 0 at three metric snapshot checkpoints in `TestJOBSMetrics`; skip `TestIssue2085` with inline comments explaining legacy client mismatch.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

roadrunner-server/jobs#127: Introduces the ACK/NACK/REQUEUE protocol feature that this PR builds upon with Outcome tracking and metric instrumentation.
roadrunner-server/jobs#128: Modifies protocol handler logic to support the new jobs API and requeue semantics, sharing core code paths with this PR.
roadrunner-server/jobs#139: Implements multi-pool support that touches Plugin.Execute and related configuration/routing paths in the same codebase.

Suggested labels

enhancement

Suggested reviewers

wolfy-j

Poem

🐰 Hops with glee through metric trails,
Requeued jobs now tell their tales,
Outcome states in protocol's hand,
Health checks make CI grand,
LocalStack's ready, tests align!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title 'feature: add rr_jobs_jobs_requeue metric' directly and clearly describes the main change: adding a new requeue metric to the JOBS plugin.
Description check	✅ Passed	PR description clearly explains the motivation, implementation approach, and testing strategy, though the template's checklist items are not explicitly addressed.
Linked Issues check	✅ Passed	PR fully implements issue `#1566` requirement to distinguish re-queued jobs from failed ones by introducing a dedicated rr_jobs_jobs_requeue metric.
Out of Scope Changes check	✅ Passed	All changes directly support the requeue metric feature: workflow improvements for testing reliability, protocol handler refactoring for outcome tracking, metrics instrumentation, and comprehensive tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feature/jobs-requeue-metric

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

Adds a dedicated rr_jobs_jobs_requeue metric to make job requeue outcomes observable in Prometheus, and wires the protocol response handler to emit this signal for requeue-related responses.

Changes:

Introduces jobs_requeue tracking in the JOBS Prometheus exporter and asserts it in the metrics integration test.
Extends the protocol response handler to accept a small metrics interface and increment requeue counts on Error+requeue, NACK+requeue, and explicit REQUEUE.
Adds unit tests covering response→requeue-metric mapping and ensuring failed driver requeues are not counted.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`tests/jobs_general_test.go`	Asserts the new `rr_jobs_jobs_requeue` metric is present in exported metrics output.
`protocol/handler.go`	Adds a `MetricsCounter` interface and passes it into the response handler.
`protocol/handler_test.go`	Adds unit tests validating requeue metric increments for relevant response types.
`protocol/error.go`	Increments the requeue metric on successful requeue paths.
`plugin.go`	Wires plugin metrics exporter into the protocol response handler.
`metrics.go`	Adds `jobs_requeue` exporter state + descriptor and emits it during collection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Handle now reports whether the job was re-queued; the listener increments jobs_ok only when it wasn't, so a re-queue is counted once (jobs_requeue) instead of also bumping jobs_ok.

TestJobsInit ran SQS against an untagged localstack/localstack image with only a blind sleep 30 for readiness, causing intermittent 'SQS CreateQueue: context deadline exceeded'. Pin LocalStack to 4.14.0 (SERVICES=sqs + eager loading) and replace the sleep with a bounded /_localstack/health SQS readiness gate that fails fast with the container logs.

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/linux.yml:
- Line 69: The jq readiness check uses test("available|running") which can match
substrings like "unavailable"; update the regex used in that check (the
test(...) invocation) to anchor the alternatives so only exact statuses match
(e.g. use ^(...)$ or word boundaries around available|running) so the CI gate
only passes when SQS is actually "available" or "running".

In `@metrics.go`:
- Around line 111-115: The metrics exported by statsExporter.Collect() are
currently created with prometheus.GaugeValue but represent
monotonically-incremented counters (se.jobsOk, se.jobsErr, se.jobsRequeue,
se.pushOk, se.pushErr), so change their metric type to prometheus.CounterValue
when calling prometheus.MustNewConstMetric for se.jobsOkDesc, se.jobsErrDesc,
se.jobsRequeueDesc, se.pushOkDesc and se.pushErrDesc to ensure Prometheus treats
them as counters (use the same value retrieval via .Load() but pass
prometheus.CounterValue). Ensure the change is applied where
statsExporter.Collect() emits these metrics.

In `@protocol/handler.go`:
- Around line 64-68: Replace the boolean "requeued" return from
RespHandler.Handle with a small outcome enum (e.g., Outcome{OK, Failed,
Requeued}) so callers can distinguish success vs terminal failure vs requeue;
change Handle's signature to return (Outcome, error), update the implementations
that call/return from handleErrResp and handleNackResponse to return the correct
Outcome value (handleErrResp -> Failed or Requeued as appropriate,
handleNackResponse -> Failed or Requeued), and update the caller(s) such as the
listener logic that currently checks the boolean to instead branch on Outcome to
increment the correct metric (jobs_ok vs jobs_requeue vs final failures).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 46b5bdfc-da9e-403a-9396-09d35fd86c83

📥 Commits

Reviewing files that changed from the base of the PR and between dd3c924 and 424a4ab.

📒 Files selected for processing (9)

.github/workflows/linux.yml
listener.go
metrics.go
plugin.go
protocol/error.go
protocol/handler.go
protocol/handler_test.go
tests/env/docker-compose-jobs.yaml
tests/jobs_general_test.go

The jobs RPC migrated to Connect-RPC (jobs.v2.JobsService), but the PHP spiral/roadrunner-jobs client (v4.7.0, latest) still calls the legacy goridge jobs.List method, so $jobs->count() in server.on_init misreads the Connect response and OOMs the worker. Skip pending a Connect-RPC-capable PHP client; tracked in roadrunner#2085. Unrelated to the requeue metric and SQS CI changes in this PR.

The unanchored test("available|running") also matched substrings such as "unavailable", so the gate could pass before SQS was ready. Anchor with ^(...)$ to match only the exact ready states.

rr_jobs_jobs_{ok,err,requeue} and rr_jobs_push_{ok,err} increment monotonically, so export them as prometheus.CounterValue (were GaugeValue); gauges break PromQL rate()/increase().

Handle returns an Outcome enum (ok/failed/requeued) instead of a requeued bool; the listener records the matching metric. Terminal failures (Error or NACK without requeue) now count as rr_jobs_jobs_err instead of rr_jobs_jobs_ok. The protocol package no longer depends on a metrics interface.

Copilot AI review requested due to automatic review settings June 2, 2026 21:19

Copilot started reviewing on behalf of rustatian June 2, 2026 21:19 View session

Copilot AI reviewed Jun 2, 2026

View reviewed changes

Comment thread protocol/error.go Outdated

rustatian added 2 commits June 2, 2026 23:35

fix(metrics): don't double-count a re-queued job as jobs_ok

978eadc

Handle now reports whether the job was re-queued; the listener increments jobs_ok only when it wasn't, so a re-queue is counted once (jobs_requeue) instead of also bumping jobs_ok.

coderabbitai Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread .github/workflows/linux.yml Outdated

Comment thread metrics.go Outdated

Comment thread protocol/handler.go Outdated

rustatian self-assigned this Jun 3, 2026

rustatian added 4 commits June 3, 2026 17:38

ci(jobs): anchor the localstack sqs readiness regex

c6453e4

The unanchored test("available|running") also matched substrings such as "unavailable", so the gate could pass before SQS was ready. Anchor with ^(...)$ to match only the exact ready states.

fix(metrics): export job and push tallies as counters

0857632

rr_jobs_jobs_{ok,err,requeue} and rr_jobs_push_{ok,err} increment monotonically, so export them as prometheus.CounterValue (were GaugeValue); gauges break PromQL rate()/increase().

rustatian merged commit e57be7c into master Jun 3, 2026
7 checks passed

rustatian deleted the feature/jobs-requeue-metric branch June 3, 2026 17:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feature: add rr_jobs_jobs_requeue metric#149

feature: add rr_jobs_jobs_requeue metric#149
rustatian merged 7 commits into
masterfrom
feature/jobs-requeue-metric

rustatian commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rustatian commented Jun 2, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rustatian commented Jun 2, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 2, 2026 •

edited

Loading