Skip to content

feature: add rr_jobs_jobs_requeue metric#149

Merged
rustatian merged 7 commits into
masterfrom
feature/jobs-requeue-metric
Jun 3, 2026
Merged

feature: add rr_jobs_jobs_requeue metric#149
rustatian merged 7 commits into
masterfrom
feature/jobs-requeue-metric

Conversation

@rustatian

@rustatian rustatian commented Jun 2, 2026

Copy link
Copy Markdown
Member

Adds a dedicated rr_jobs_jobs_requeue counter so re-queued jobs can be told apart from finally-failed and successfully-processed ones. Until now a re-queue returned nil from the protocol handler and was tallied as rr_jobs_jobs_ok, so it was invisible.

  • new rr_jobs_jobs_requeue metric on the JOBS stats exporter, mirroring jobs_ok / jobs_err
  • incremented in the protocol response handler wherever a job is actually re-queued: Error + requeue, NACK with requeue, and explicit REQUEUE
  • the handler takes a small MetricsCounter interface, so the protocol package stays free of a jobs import cycle
  • unit tests for the re-queue → metric mapping (incl. a failed re-queue not being counted); rr_jobs_jobs_requeue also asserted in the metrics integration test

closes roadrunner-server/roadrunner#1566

Summary by CodeRabbit

  • New Features

    • Added Prometheus metric for tracking requeued jobs.
  • Improvements

    • Enhanced job outcome tracking to clearly distinguish between successful, failed, and requeued states.
    • Improved Docker Compose service health verification for test environments.

Re-queued jobs returned nil from the protocol response handler and were counted as jobs_ok, so they were indistinguishable from finally-processed jobs. Add a dedicated rr_jobs_jobs_requeue counter, incremented in the response handler wherever a job is actually re-queued (Error+requeue, NACK with requeue, and explicit REQUEUE).
Copilot AI review requested due to automatic review settings June 2, 2026 21:19
@coderabbitai

coderabbitai Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: eeb6bc00-e7d9-4430-9d68-a859ed3b42f6

📥 Commits

Reviewing files that changed from the base of the PR and between b4db317 and e7b0409.

📒 Files selected for processing (6)
  • .github/workflows/linux.yml
  • listener.go
  • metrics.go
  • protocol/error.go
  • protocol/handler.go
  • protocol/handler_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/linux.yml

📝 Walkthrough

Walkthrough

This PR implements job requeue tracking by introducing an Outcome enum to the response handler API, instrumenting a new Prometheus counter for requeued jobs, wiring metrics through the plugin, adding comprehensive tests, and enhancing CI with LocalStack health checks and configuration pinning.

Changes

Requeue Outcome Tracking and Instrumentation

Layer / File(s) Summary
Outcome type definition and Handle signature
protocol/handler.go
Introduce Outcome enum with OutcomeOK, OutcomeFailed, OutcomeRequeued constants and String() method; refactor RespHandler.Handle() to return (Outcome, error) and dispatch all response types through outcome-aware branches.
Error and requeue response handling
protocol/error.go
Update handleErrResp(), handleNackResponse(), and requeue() to return (Outcome, error); propagate OutcomeRequeued from successful requeue operations, OutcomeFailed from ACK/NACK paths, and OutcomeOK with unmarshal errors.
Requeue Prometheus metric instrumentation
metrics.go
Add jobsRequeue atomic counter and jobsRequeueDesc descriptor; implement CountJobRequeue() method; update newStatsExporter(), Describe(), and Collect() to register and emit the new metric; switch existing metrics from GaugeValue to CounterValue.
Plugin and listener outcome dispatch to metrics
listener.go
Add import alias for protocol package; capture outcome from Handle() in Plugin.Execute; dispatch on outcome to increment CountJobOk(), CountJobErr(), or CountJobRequeue() with debug logging; remove unconditional success counting.
Response handler outcome and requeue tests
protocol/handler_test.go
Add fakeJob test mock and discardLogger() helper; implement TestHandleOutcome() table-driven test covering all response types and verifying returned Outcome and job state; add TestHandleRequeueError() to verify error propagation when Requeue() fails.

CI and Integration Test Infrastructure

Layer / File(s) Summary
LocalStack configuration and GitHub Actions health polling
tests/env/docker-compose-jobs.yaml, .github/workflows/linux.yml
Pin LocalStack Docker image to 4.14.0; enable eager SQS loading via environment config; replace fixed sleep with 120-second health-check polling loop that verifies SQS service is available before running tests.
Integration test metric assertions and skip
tests/jobs_general_test.go
Add assertions that rr_jobs_jobs_requeue metric equals 0 at three metric snapshot checkpoints in TestJOBSMetrics; skip TestIssue2085 with inline comments explaining legacy client mismatch.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • roadrunner-server/jobs#127: Introduces the ACK/NACK/REQUEUE protocol feature that this PR builds upon with Outcome tracking and metric instrumentation.
  • roadrunner-server/jobs#128: Modifies protocol handler logic to support the new jobs API and requeue semantics, sharing core code paths with this PR.
  • roadrunner-server/jobs#139: Implements multi-pool support that touches Plugin.Execute and related configuration/routing paths in the same codebase.

Suggested labels

enhancement

Suggested reviewers

  • wolfy-j

Poem

🐰 Hops with glee through metric trails,
Requeued jobs now tell their tales,
Outcome states in protocol's hand,
Health checks make CI grand,
LocalStack's ready, tests align!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title 'feature: add rr_jobs_jobs_requeue metric' directly and clearly describes the main change: adding a new requeue metric to the JOBS plugin.
Description check ✅ Passed PR description clearly explains the motivation, implementation approach, and testing strategy, though the template's checklist items are not explicitly addressed.
Linked Issues check ✅ Passed PR fully implements issue #1566 requirement to distinguish re-queued jobs from failed ones by introducing a dedicated rr_jobs_jobs_requeue metric.
Out of Scope Changes check ✅ Passed All changes directly support the requeue metric feature: workflow improvements for testing reliability, protocol handler refactoring for outcome tracking, metrics instrumentation, and comprehensive tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/jobs-requeue-metric

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a dedicated rr_jobs_jobs_requeue metric to make job requeue outcomes observable in Prometheus, and wires the protocol response handler to emit this signal for requeue-related responses.

Changes:

  • Introduces jobs_requeue tracking in the JOBS Prometheus exporter and asserts it in the metrics integration test.
  • Extends the protocol response handler to accept a small metrics interface and increment requeue counts on Error+requeue, NACK+requeue, and explicit REQUEUE.
  • Adds unit tests covering response→requeue-metric mapping and ensuring failed driver requeues are not counted.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/jobs_general_test.go Asserts the new rr_jobs_jobs_requeue metric is present in exported metrics output.
protocol/handler.go Adds a MetricsCounter interface and passes it into the response handler.
protocol/handler_test.go Adds unit tests validating requeue metric increments for relevant response types.
protocol/error.go Increments the requeue metric on successful requeue paths.
plugin.go Wires plugin metrics exporter into the protocol response handler.
metrics.go Adds jobs_requeue exporter state + descriptor and emits it during collection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread protocol/error.go Outdated
rustatian added 2 commits June 2, 2026 23:35
Handle now reports whether the job was re-queued; the listener increments jobs_ok only when it wasn't, so a re-queue is counted once (jobs_requeue) instead of also bumping jobs_ok.
TestJobsInit ran SQS against an untagged localstack/localstack image with only a blind sleep 30 for readiness, causing intermittent 'SQS CreateQueue: context deadline exceeded'. Pin LocalStack to 4.14.0 (SERVICES=sqs + eager loading) and replace the sleep with a bounded /_localstack/health SQS readiness gate that fails fast with the container logs.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/linux.yml:
- Line 69: The jq readiness check uses test("available|running") which can match
substrings like "unavailable"; update the regex used in that check (the
test(...) invocation) to anchor the alternatives so only exact statuses match
(e.g. use ^(...)$ or word boundaries around available|running) so the CI gate
only passes when SQS is actually "available" or "running".

In `@metrics.go`:
- Around line 111-115: The metrics exported by statsExporter.Collect() are
currently created with prometheus.GaugeValue but represent
monotonically-incremented counters (se.jobsOk, se.jobsErr, se.jobsRequeue,
se.pushOk, se.pushErr), so change their metric type to prometheus.CounterValue
when calling prometheus.MustNewConstMetric for se.jobsOkDesc, se.jobsErrDesc,
se.jobsRequeueDesc, se.pushOkDesc and se.pushErrDesc to ensure Prometheus treats
them as counters (use the same value retrieval via .Load() but pass
prometheus.CounterValue). Ensure the change is applied where
statsExporter.Collect() emits these metrics.

In `@protocol/handler.go`:
- Around line 64-68: Replace the boolean "requeued" return from
RespHandler.Handle with a small outcome enum (e.g., Outcome{OK, Failed,
Requeued}) so callers can distinguish success vs terminal failure vs requeue;
change Handle's signature to return (Outcome, error), update the implementations
that call/return from handleErrResp and handleNackResponse to return the correct
Outcome value (handleErrResp -> Failed or Requeued as appropriate,
handleNackResponse -> Failed or Requeued), and update the caller(s) such as the
listener logic that currently checks the boolean to instead branch on Outcome to
increment the correct metric (jobs_ok vs jobs_requeue vs final failures).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 46b5bdfc-da9e-403a-9396-09d35fd86c83

📥 Commits

Reviewing files that changed from the base of the PR and between dd3c924 and 424a4ab.

📒 Files selected for processing (9)
  • .github/workflows/linux.yml
  • listener.go
  • metrics.go
  • plugin.go
  • protocol/error.go
  • protocol/handler.go
  • protocol/handler_test.go
  • tests/env/docker-compose-jobs.yaml
  • tests/jobs_general_test.go

Comment thread .github/workflows/linux.yml Outdated
Comment thread metrics.go Outdated
Comment thread protocol/handler.go Outdated
@rustatian rustatian self-assigned this Jun 3, 2026
rustatian added 4 commits June 3, 2026 17:38
The jobs RPC migrated to Connect-RPC (jobs.v2.JobsService), but the PHP spiral/roadrunner-jobs client (v4.7.0, latest) still calls the legacy goridge jobs.List method, so $jobs->count() in server.on_init misreads the Connect response and OOMs the worker. Skip pending a Connect-RPC-capable PHP client; tracked in roadrunner#2085. Unrelated to the requeue metric and SQS CI changes in this PR.
The unanchored test("available|running") also matched substrings such as "unavailable", so the gate could pass before SQS was ready. Anchor with ^(...)$ to match only the exact ready states.
rr_jobs_jobs_{ok,err,requeue} and rr_jobs_push_{ok,err} increment monotonically, so export them as prometheus.CounterValue (were GaugeValue); gauges break PromQL rate()/increase().
Handle returns an Outcome enum (ok/failed/requeued) instead of a requeued bool; the listener records the matching metric. Terminal failures (Error or NACK without requeue) now count as rr_jobs_jobs_err instead of rr_jobs_jobs_ok. The protocol package no longer depends on a metrics interface.
@rustatian rustatian merged commit e57be7c into master Jun 3, 2026
7 checks passed
@rustatian rustatian deleted the feature/jobs-requeue-metric branch June 3, 2026 17:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[💡 FEATURE REQUEST]: Add requeue tag to failed task metric of Jobs plugin

2 participants