fix: rank fallback endpoints by least-stale instead of fully-random by oten91 · Pull Request #512 · pokt-network/path

oten91 · 2026-05-01T11:02:26Z

Summary

When all session endpoints fail QoS validation, PATH fell back to picking a random endpoint from the unfiltered pool. Random selection treated a far-behind endpoint exactly the same as a slightly-behind one, which concentrated fallback traffic on whichever stable single-endpoint suppliers happened to survive in the pool.

We confirmed this in mainnet against poly: 50 endpoints, all at score=100 / tier=1 / no cooldowns, but per-endpoint RPS spread from ~8 (20-endpoint domains) to ~384 (1-endpoint domains). Cross-referencing path_qos_filter_rejection_total against /ready/poly?detailed=true showed the lopsided distribution traced back to the random fallback path being hit when transient block-height-lag observations rejected most candidates simultaneously.

What changes

Replaces STANDARD_FALLBACK_RANDOM (and the archival / cosmos / solana equivalents) with selectLeastStaleEndpoints:

Scores each candidate by perceivedBlockHeight - endpointBlockHeight
Random-shuffles, then stable-sorts so endpoints with known heights always rank above no-data candidates
Random tie-break across endpoints at the same staleness — distribution is preserved within ties
Cold start (no chain context, sync_allowance=0, perceivedBlock=0) degrades gracefully to the prior random behavior

EVM uses the existing URL-block-height map (and extracts it as a shared buildURLBlockHeightMap helper, deduplicating two prior call sites). Cosmos and solana use per-endpoint block heights from their respective state structs (endpoint.checkCometBFTStatus.latestBlockHeight and endpoint.SolanaGetEpochInfoResponse.BlockHeight).

What this does NOT change

The fresh-endpoint-stale-infra bug (cosmos/solana don't yet have a URL-block-height map; EVM does). That's a separate fix worth its own PR — see comment thread for the rationale.
noop QoS still uses RandomSelectMultiple because noop has no validation / perceived state / block-height concept. Random there is correct, not a bug.

Test plan

go test ./qos/... ./protocol/... -short -count=1 — all green
go vet ./qos/... — clean
New tests: 9 total (3 per QoS) covering least-stale ranking, no-data deprioritization, and tie-break distribution
Canary soak: confirm per-endpoint RPS distribution flattens for services that hit fallback frequently (e.g., poly, where stable single-endpoint suppliers had been winning disproportionately)
Confirm dashboard path_qos_filter_rejection_total{reason="block_height_lag"} rate is unchanged (filter rejection rate shouldn't move; only the post-rejection routing distribution changes)

🤖 Generated with Claude Code

When all session endpoints fail QoS validation, the prior fallback picked randomly from the unfiltered pool — a far-behind endpoint and a slightly-behind endpoint had equal probability. This concentrated traffic on whichever stable single-endpoint suppliers happened to survive in the pool, giving them a hedge-magnet boost they hadn't earned and starving suppliers with more endpoints that had transient block-height-lag observations. Replaces both the EVM `STANDARD_FALLBACK_RANDOM` / `ARCHIVAL_FALLBACK_RANDOM` paths and their cosmos/solana equivalents with `selectLeastStaleEndpoints`, which scores each candidate by `perceivedBlockHeight - endpointBlockHeight`, random-shuffles, then stable-sorts so: - endpoints with known block heights always rank above no-data candidates - within ties, traffic spreads randomly across equally-stale endpoints - cold start (no chain context) degrades gracefully to prior random behavior EVM also extracts the URL→block-height map build into a shared `buildURLBlockHeightMap` helper, deduplicating logic that previously lived in `filterValidEndpointsWithDetails` and `filterStaleURLEndpoints`. Adds test coverage for ranking, no-data deprioritization, and tie-break distribution across all three QoS implementations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…pliers The hedge race fan-out was double-counting in path_relays_total: both the primary AND the hedge branch's relay completed and called RecordRelay with request_type="normal", inflating the visible RPS for whichever endpoint happens to be picked as hedge most often (typically the lowest-latency one). Symptom observed on canary at 22m post-deploy of #512: rpcgate.xyz with 1 endpoint at 431 RPS while stakenodes.org with 21 endpoints averaged 7.7 RPS per endpoint. Per-endpoint share was 17x off baseline despite uniform-random tier-1 selection. The skew traced to (a) hedge metric double-count and (b) hedge losers getting latency-penalty signals for slowness that only mattered inside the race they already lost. This change: - Adds RelayTypeHedge metric constant - Adds MarkAsHedge() to ProtocolRequestContext interface - Hedge branch is tagged before HandleServiceRequest; protocol layer records the relay under request_type="hedge" instead of "normal" - Skips recordLatencyPenaltySignalsIfNeeded for hedge branches — far-from- gateway endpoints no longer accumulate slow-response reputation penalties for losing hedge races they only entered because of distance - Reputation success/error signals still fire so endpoints get fair feedback about whether they actually work Dashboards filtering request_type="normal" by default now show fair primary- traffic distribution. Operators can flip the variable to "hedge" to see the backup-attempt side independently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

oten91 · 2026-05-01T11:57:35Z

Update: bundled hedge-tagging fix into this PR

22-min canary observation surfaced a separate skew the original PR didn't address:

Domain	Eps	RPS	RPS/ep	p50
rpcgate.xyz	1	431	431	26.4ms
stakenodes.org	21	161	7.7	178ms

Selection within tier-1 IS uniform random, but path_relays_total was double-counting hedge fan-out (both primary AND hedge branch hit RecordRelay with request_type="normal"). Plus hedge losers were getting latency-penalty reputation signals for slowness that only mattered inside the race they already lost — gradually drifting far-from-gateway endpoints downward in tier rank.

Fix (commit `94283d9`)

New RelayTypeHedge = "hedge" constant
New MarkAsHedge() method on ProtocolRequestContext; hedge branch is tagged before HandleServiceRequest
Protocol layer records hedge relays under request_type="hedge" instead of "normal"
recordLatencyPenaltySignalsIfNeeded skipped on hedge branches — geography doesn't get penalized
Reputation success/error signals still fire (fair feedback about whether the endpoint works)

Dashboard impact

path-quality-dashboard-public-pinned.json filters request_type="normal" by default → operators see fair primary-traffic distribution
Internal variable-rich dashboard auto-discovers "hedge" in the request_type template variable; can flip to inspect hedge layer independently
"Relays by Type" pie chart will show a 4th slice (hedge) once deployed → operator visibility into how much hedging is happening per service

Test coverage

protocol/shannon/hedge_tagging_test.go: compile-time interface assertion + MarkAsHedge() flag behavior + idempotency
go test ./... -short green; go vet ./... clean

Will be picked up by the scheduled 1h soak check at ~14:29 Europe/Madrid.

oten91 · 2026-05-01T12:31:50Z

🟢 Canary soak verdict — GO for mainnet (with one caveat)

Soak ran from 2026-05-01T09:10Z to 2026-05-01T12:29Z. Replicaset gateway-84f9cb89cc (least-stale fix only) ran for ~3h, then gateway-6bb94c84b9 (added hedge-tagging fix) rolled at ~12:15 and is currently 14 minutes into its own bake.

Go/no-go signals

Signal	Threshold	Measured	Verdict
Canonical leak: spacebelt/solana req/s	MUST = 0	1 total relay in 14m on observed pod (~0.001/s) vs 0.00476 pre-fix mainnet	✅
Pod restarts on new replicaset	0	0 across all 15 pods	✅
poly success rate (request_type=normal)	≥ baseline−2pp	96.12% (n=45,996)	✅
solana success rate (request_type=normal)	≥ baseline−2pp	96.89% (n=6,308)	✅
eth success rate (request_type=normal)	≥ baseline−2pp	91.63% (n=9,258)	⚠️ low absolute, but no regression evidence
`request_type="hedge"` series present	yes	yes — ~100-600 samples per top domain	✅
Single-endpoint domain >30% on poly normal	flag if any	rpcgate (1ep) at 25.1%, spacebelt (2ep) at 36.1%	⚠️ cold-start, not regression — see below

Anomalies

poly normal-traffic distribution still skewed: spacebelt 36.1% / rpcgate 25.1% on a 14-min-old pod. Reputation hasn't equilibrated — only domains that earned early successes are in tier 1 yet. Easy2stake (19 eps) at 36.1% is fair-share; the others are bootstrap artifact. Will flatten as the pod warms up. Not a regression caused by either fix.
Hedge fix has only 14 min of soak vs 3h on the least-stale fix. The least-stale fix is the larger code surface, so the well-soaked one is the riskier of the two — this works in our favor.

Recommendation

GO for mainnet deploy. The canonical bug fix is verified live (spacebelt-on-solana leak closed) and no health regressions are visible. Suggest deploying with the same observation pattern: watch path_relays_total{service_id="solana", domain="spacebelt.xyz", request_type="normal"} post-deploy — it should stay near 0.

Deploy command (whatever is your standard mainnet promotion):

# Whichever script promotes the canary image SHA to mainnet

Soak caveat to note

The hedge-tagging commit landed mid-soak. If you want a strict 1h-on-current-binary check before mainnet, wait until ~13:15Z and re-run. If you trust the squash-merge approach (both commits ship together either way), deploying now is fine — the canonical signal is unambiguous.

oten91 · 2026-05-01T12:39:49Z

✅ Mainnet deploy verified healthy

Rolled to mainnet at 2026-05-01T12:36Z, replicaset gateway-5b7df945c. All 11 pods up, 0 restarts.

Check	Status
Rollout	✅ 11/11 ready, 0 restarts
`request_type="hedge"` tagging	✅ confirmed live in metrics
Cooldown enforcement	✅ `nf.relayminer.` + 1 `non-custodial.` correctly at score=0 / in_cooldown
Solana spacebelt	recovered to score=100 across 3 endpoints — legitimately tier-1, getting fair-share traffic (4% vs 6% endpoint share)
Hedge magnet visibility	spacebelt: 395 hedge / 115 normal on observed pod — dashboard's `request_type="normal"` filter now shows fair distribution
Service success rates	poly 93.7%, solana 95.1%, eth 88.1% — within tolerance

The "spacebelt = 0 req/s" canonical check was canary-specific (their score was still 0 on canary at observation time). On mainnet they've already recovered, which is the correct outcome — the fix doesn't suppress legitimate traffic to recovered endpoints, it just stops the fallback bypass when they're below threshold.

Cold-start skew on poly normal-traffic distribution (rpcgate at 61%) is the expected bootstrap artifact and will flatten as reputation equilibrates over the next 30-60 min. Not a regression.

PR can be marked merged. Closing the loop on the canary soak prediction: GO was correct.

… Prometheus metrics (#513) ## Summary Two metrics that were previously only observable via `/ready` introspection or direct Redis access are now first-class Prometheus gauges: - **`path_circuit_breaker_state{service_id, domain}`** — 1 if domain is currently locked out, 0 otherwise - **`path_endpoints_in_cooldown{domain, rpc_type, service_id}`** — count of endpoints currently in strike cooldown Both metrics fill real operational visibility gaps that came up during PR #512 work — operators could see broken-domain effects in error-rate graphs but couldn't directly answer "which domains are circuit-broken right now?" without Redis access. ## What this enables on dashboards Once Grafana picks these up: - **"Currently Broken Domains" table** — list-of-domains view with service_id + domain + state, sortable by service. - **"Broken Domains" stat tile** — single number for at-a-glance health (0 = healthy, >5 = widespread infra issues). - **"Cooldown" column** on the Supplier Quality table — operators can see "how many of my endpoints are in cooldown right now?" alongside RPS, success%, etc. These dashboard panels are not in this PR (kept to the metric-only diff for review clarity); they'll go in a small follow-up commit that updates `local/observability/dashboards/*.json` once this metric is deployed. ## Implementation notes ### `path_circuit_breaker_state` State transitions in `gateway/domain_circuit_breaker.go`: | Trigger | Gauge becomes | |---|---| | `MarkBroken` | 1 | | `ClearService` | 0 (for every cleared domain) | | `refreshLocal` finds expired entries | 0 (for each expired domain) | | `refreshFromRedis` finds expired entries | 0 (for each), and re-asserts 1 for currently-broken domains so fresh pods that lazily pick up Redis state stay consistent without going through MarkBroken locally | All gauge sets happen *outside* `cb.mu` to avoid taking the metrics lock under the cache mutex. There is a small inherent staleness window: a circuit-breaker entry whose TTL just expired remains at gauge=1 until the cache TTL elapses (5s default) and `refreshLocal` runs. That's bounded and fine for a dashboard. ### `path_endpoints_in_cooldown` New `LeaderboardDataProvider.GetCooldownCountData(ctx)` method, implemented on Shannon's `Protocol`. Walks active sessions, fetches each endpoint's reputation score, increments per-(domain, service_id, rpc_type) when `score.IsInCooldown()` returns true. Published every 10s alongside the existing leaderboard / mean score / supplier score metrics. Resets between snapshots so a domain dropping to zero cooldown'd endpoints shows zero (rather than sticking at its last value via Prometheus' 5-min staleness window). ## Test plan - [x] `go test ./...` — all green - [x] `go vet ./...` — clean - [x] New test: `TestDomainCircuitBreaker_MetricGaugeTransitions` covers MarkBroken → 1, ClearService → 0, TTL expiry + refresh → 0 - [ ] Canary deploy: verify `path_circuit_breaker_state` and `path_endpoints_in_cooldown` series appear in Prometheus - [ ] Trigger a circuit break (or use `/admin/circuit-breaker/clear/{serviceId}` to test the clear path) and confirm the gauge transitions - [ ] Verify staleness behavior: entry expires → gauge drops to 0 within ~5s ## Cardinality Both new metrics are bounded by `services × domains` — already low cardinality (~50-200 unique domains × ~80 services on mainnet). No risk of cardinality explosion. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

oten91 and others added 2 commits May 1, 2026 13:01

oten91 merged commit cda5557 into main May 1, 2026
11 of 13 checks passed

oten91 deleted the fix/standard-fallback-least-stale branch May 1, 2026 13:38

oten91 mentioned this pull request May 1, 2026

feat: surface circuit-breaker state and per-domain cooldown counts as Prometheus metrics #513

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: rank fallback endpoints by least-stale instead of fully-random#512

fix: rank fallback endpoints by least-stale instead of fully-random#512
oten91 merged 2 commits into
mainfrom
fix/standard-fallback-least-stale

oten91 commented May 1, 2026

Uh oh!

oten91 commented May 1, 2026

Uh oh!

oten91 commented May 1, 2026

Uh oh!

oten91 commented May 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

oten91 commented May 1, 2026

Summary

What changes

What this does NOT change

Test plan

Uh oh!

oten91 commented May 1, 2026

Update: bundled hedge-tagging fix into this PR

Fix (commit 94283d9)

Dashboard impact

Test coverage

Uh oh!

oten91 commented May 1, 2026

🟢 Canary soak verdict — GO for mainnet (with one caveat)

Go/no-go signals

Anomalies

Recommendation

Soak caveat to note

Uh oh!

oten91 commented May 1, 2026

✅ Mainnet deploy verified healthy

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Fix (commit `94283d9`)