Skip to content

fix: rank fallback endpoints by least-stale instead of fully-random#512

Merged
oten91 merged 2 commits into
mainfrom
fix/standard-fallback-least-stale
May 1, 2026
Merged

fix: rank fallback endpoints by least-stale instead of fully-random#512
oten91 merged 2 commits into
mainfrom
fix/standard-fallback-least-stale

Conversation

@oten91

@oten91 oten91 commented May 1, 2026

Copy link
Copy Markdown
Contributor

Summary

When all session endpoints fail QoS validation, PATH fell back to picking a random endpoint from the unfiltered pool. Random selection treated a far-behind endpoint exactly the same as a slightly-behind one, which concentrated fallback traffic on whichever stable single-endpoint suppliers happened to survive in the pool.

We confirmed this in mainnet against poly: 50 endpoints, all at score=100 / tier=1 / no cooldowns, but per-endpoint RPS spread from ~8 (20-endpoint domains) to ~384 (1-endpoint domains). Cross-referencing path_qos_filter_rejection_total against /ready/poly?detailed=true showed the lopsided distribution traced back to the random fallback path being hit when transient block-height-lag observations rejected most candidates simultaneously.

What changes

Replaces STANDARD_FALLBACK_RANDOM (and the archival / cosmos / solana equivalents) with selectLeastStaleEndpoints:

  • Scores each candidate by perceivedBlockHeight - endpointBlockHeight
  • Random-shuffles, then stable-sorts so endpoints with known heights always rank above no-data candidates
  • Random tie-break across endpoints at the same staleness — distribution is preserved within ties
  • Cold start (no chain context, sync_allowance=0, perceivedBlock=0) degrades gracefully to the prior random behavior

EVM uses the existing URL-block-height map (and extracts it as a shared buildURLBlockHeightMap helper, deduplicating two prior call sites). Cosmos and solana use per-endpoint block heights from their respective state structs (endpoint.checkCometBFTStatus.latestBlockHeight and endpoint.SolanaGetEpochInfoResponse.BlockHeight).

What this does NOT change

  • The fresh-endpoint-stale-infra bug (cosmos/solana don't yet have a URL-block-height map; EVM does). That's a separate fix worth its own PR — see comment thread for the rationale.
  • noop QoS still uses RandomSelectMultiple because noop has no validation / perceived state / block-height concept. Random there is correct, not a bug.

Test plan

  • go test ./qos/... ./protocol/... -short -count=1 — all green
  • go vet ./qos/... — clean
  • New tests: 9 total (3 per QoS) covering least-stale ranking, no-data deprioritization, and tie-break distribution
  • Canary soak: confirm per-endpoint RPS distribution flattens for services that hit fallback frequently (e.g., poly, where stable single-endpoint suppliers had been winning disproportionately)
  • Confirm dashboard path_qos_filter_rejection_total{reason="block_height_lag"} rate is unchanged (filter rejection rate shouldn't move; only the post-rejection routing distribution changes)

🤖 Generated with Claude Code

oten91 and others added 2 commits May 1, 2026 13:01
When all session endpoints fail QoS validation, the prior fallback picked
randomly from the unfiltered pool — a far-behind endpoint and a slightly-behind
endpoint had equal probability. This concentrated traffic on whichever stable
single-endpoint suppliers happened to survive in the pool, giving them a
hedge-magnet boost they hadn't earned and starving suppliers with more endpoints
that had transient block-height-lag observations.

Replaces both the EVM `STANDARD_FALLBACK_RANDOM` / `ARCHIVAL_FALLBACK_RANDOM`
paths and their cosmos/solana equivalents with `selectLeastStaleEndpoints`,
which scores each candidate by `perceivedBlockHeight - endpointBlockHeight`,
random-shuffles, then stable-sorts so:
  - endpoints with known block heights always rank above no-data candidates
  - within ties, traffic spreads randomly across equally-stale endpoints
  - cold start (no chain context) degrades gracefully to prior random behavior

EVM also extracts the URL→block-height map build into a shared
`buildURLBlockHeightMap` helper, deduplicating logic that previously lived in
`filterValidEndpointsWithDetails` and `filterStaleURLEndpoints`.

Adds test coverage for ranking, no-data deprioritization, and tie-break
distribution across all three QoS implementations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pliers

The hedge race fan-out was double-counting in path_relays_total: both the
primary AND the hedge branch's relay completed and called RecordRelay with
request_type="normal", inflating the visible RPS for whichever endpoint
happens to be picked as hedge most often (typically the lowest-latency one).

Symptom observed on canary at 22m post-deploy of #512: rpcgate.xyz with 1
endpoint at 431 RPS while stakenodes.org with 21 endpoints averaged 7.7 RPS
per endpoint. Per-endpoint share was 17x off baseline despite uniform-random
tier-1 selection. The skew traced to (a) hedge metric double-count and (b)
hedge losers getting latency-penalty signals for slowness that only mattered
inside the race they already lost.

This change:
- Adds RelayTypeHedge metric constant
- Adds MarkAsHedge() to ProtocolRequestContext interface
- Hedge branch is tagged before HandleServiceRequest; protocol layer records
  the relay under request_type="hedge" instead of "normal"
- Skips recordLatencyPenaltySignalsIfNeeded for hedge branches — far-from-
  gateway endpoints no longer accumulate slow-response reputation penalties
  for losing hedge races they only entered because of distance
- Reputation success/error signals still fire so endpoints get fair feedback
  about whether they actually work

Dashboards filtering request_type="normal" by default now show fair primary-
traffic distribution. Operators can flip the variable to "hedge" to see the
backup-attempt side independently.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@oten91

oten91 commented May 1, 2026

Copy link
Copy Markdown
Contributor Author

Update: bundled hedge-tagging fix into this PR

22-min canary observation surfaced a separate skew the original PR didn't address:

Domain Eps RPS RPS/ep p50
rpcgate.xyz 1 431 431 26.4ms
stakenodes.org 21 161 7.7 178ms

Selection within tier-1 IS uniform random, but path_relays_total was double-counting hedge fan-out (both primary AND hedge branch hit RecordRelay with request_type="normal"). Plus hedge losers were getting latency-penalty reputation signals for slowness that only mattered inside the race they already lost — gradually drifting far-from-gateway endpoints downward in tier rank.

Fix (commit 94283d9)

  • New RelayTypeHedge = "hedge" constant
  • New MarkAsHedge() method on ProtocolRequestContext; hedge branch is tagged before HandleServiceRequest
  • Protocol layer records hedge relays under request_type="hedge" instead of "normal"
  • recordLatencyPenaltySignalsIfNeeded skipped on hedge branches — geography doesn't get penalized
  • Reputation success/error signals still fire (fair feedback about whether the endpoint works)

Dashboard impact

  • path-quality-dashboard-public-pinned.json filters request_type="normal" by default → operators see fair primary-traffic distribution
  • Internal variable-rich dashboard auto-discovers "hedge" in the request_type template variable; can flip to inspect hedge layer independently
  • "Relays by Type" pie chart will show a 4th slice (hedge) once deployed → operator visibility into how much hedging is happening per service

Test coverage

  • protocol/shannon/hedge_tagging_test.go: compile-time interface assertion + MarkAsHedge() flag behavior + idempotency
  • go test ./... -short green; go vet ./... clean

Will be picked up by the scheduled 1h soak check at ~14:29 Europe/Madrid.

@oten91

oten91 commented May 1, 2026

Copy link
Copy Markdown
Contributor Author

🟢 Canary soak verdict — GO for mainnet (with one caveat)

Soak ran from 2026-05-01T09:10Z to 2026-05-01T12:29Z. Replicaset gateway-84f9cb89cc (least-stale fix only) ran for ~3h, then gateway-6bb94c84b9 (added hedge-tagging fix) rolled at ~12:15 and is currently 14 minutes into its own bake.

Go/no-go signals

Signal Threshold Measured Verdict
Canonical leak: spacebelt/solana req/s MUST = 0 1 total relay in 14m on observed pod (~0.001/s) vs 0.00476 pre-fix mainnet
Pod restarts on new replicaset 0 0 across all 15 pods
poly success rate (request_type=normal) ≥ baseline−2pp 96.12% (n=45,996)
solana success rate (request_type=normal) ≥ baseline−2pp 96.89% (n=6,308)
eth success rate (request_type=normal) ≥ baseline−2pp 91.63% (n=9,258) ⚠️ low absolute, but no regression evidence
request_type="hedge" series present yes yes — ~100-600 samples per top domain
Single-endpoint domain >30% on poly normal flag if any rpcgate (1ep) at 25.1%, spacebelt (2ep) at 36.1% ⚠️ cold-start, not regression — see below

Anomalies

  • poly normal-traffic distribution still skewed: spacebelt 36.1% / rpcgate 25.1% on a 14-min-old pod. Reputation hasn't equilibrated — only domains that earned early successes are in tier 1 yet. Easy2stake (19 eps) at 36.1% is fair-share; the others are bootstrap artifact. Will flatten as the pod warms up. Not a regression caused by either fix.
  • Hedge fix has only 14 min of soak vs 3h on the least-stale fix. The least-stale fix is the larger code surface, so the well-soaked one is the riskier of the two — this works in our favor.

Recommendation

GO for mainnet deploy. The canonical bug fix is verified live (spacebelt-on-solana leak closed) and no health regressions are visible. Suggest deploying with the same observation pattern: watch path_relays_total{service_id="solana", domain="spacebelt.xyz", request_type="normal"} post-deploy — it should stay near 0.

Deploy command (whatever is your standard mainnet promotion):

# Whichever script promotes the canary image SHA to mainnet

Soak caveat to note

The hedge-tagging commit landed mid-soak. If you want a strict 1h-on-current-binary check before mainnet, wait until ~13:15Z and re-run. If you trust the squash-merge approach (both commits ship together either way), deploying now is fine — the canonical signal is unambiguous.

@oten91

oten91 commented May 1, 2026

Copy link
Copy Markdown
Contributor Author

✅ Mainnet deploy verified healthy

Rolled to mainnet at 2026-05-01T12:36Z, replicaset gateway-5b7df945c. All 11 pods up, 0 restarts.

Check Status
Rollout ✅ 11/11 ready, 0 restarts
request_type="hedge" tagging ✅ confirmed live in metrics
Cooldown enforcement nf.relayminer.* + 1 non-custodial.* correctly at score=0 / in_cooldown
Solana spacebelt recovered to score=100 across 3 endpoints — legitimately tier-1, getting fair-share traffic (4% vs 6% endpoint share)
Hedge magnet visibility spacebelt: 395 hedge / 115 normal on observed pod — dashboard's request_type="normal" filter now shows fair distribution
Service success rates poly 93.7%, solana 95.1%, eth 88.1% — within tolerance

The "spacebelt = 0 req/s" canonical check was canary-specific (their score was still 0 on canary at observation time). On mainnet they've already recovered, which is the correct outcome — the fix doesn't suppress legitimate traffic to recovered endpoints, it just stops the fallback bypass when they're below threshold.

Cold-start skew on poly normal-traffic distribution (rpcgate at 61%) is the expected bootstrap artifact and will flatten as reputation equilibrates over the next 30-60 min. Not a regression.

PR can be marked merged. Closing the loop on the canary soak prediction: GO was correct.

@oten91 oten91 merged commit cda5557 into main May 1, 2026
11 of 13 checks passed
@oten91 oten91 deleted the fix/standard-fallback-least-stale branch May 1, 2026 13:38
oten91 added a commit that referenced this pull request May 6, 2026
… Prometheus metrics (#513)

## Summary

Two metrics that were previously only observable via `/ready`
introspection or direct Redis access are now first-class Prometheus
gauges:

- **`path_circuit_breaker_state{service_id, domain}`** — 1 if domain is
currently locked out, 0 otherwise
- **`path_endpoints_in_cooldown{domain, rpc_type, service_id}`** — count
of endpoints currently in strike cooldown

Both metrics fill real operational visibility gaps that came up during
PR #512 work — operators could see broken-domain effects in error-rate
graphs but couldn't directly answer "which domains are circuit-broken
right now?" without Redis access.

## What this enables on dashboards

Once Grafana picks these up:

- **"Currently Broken Domains" table** — list-of-domains view with
service_id + domain + state, sortable by service.
- **"Broken Domains" stat tile** — single number for at-a-glance health
(0 = healthy, >5 = widespread infra issues).
- **"Cooldown" column** on the Supplier Quality table — operators can
see "how many of my endpoints are in cooldown right now?" alongside RPS,
success%, etc.

These dashboard panels are not in this PR (kept to the metric-only diff
for review clarity); they'll go in a small follow-up commit that updates
`local/observability/dashboards/*.json` once this metric is deployed.

## Implementation notes

### `path_circuit_breaker_state`

State transitions in `gateway/domain_circuit_breaker.go`:

| Trigger | Gauge becomes |
|---|---|
| `MarkBroken` | 1 |
| `ClearService` | 0 (for every cleared domain) |
| `refreshLocal` finds expired entries | 0 (for each expired domain) |
| `refreshFromRedis` finds expired entries | 0 (for each), and
re-asserts 1 for currently-broken domains so fresh pods that lazily pick
up Redis state stay consistent without going through MarkBroken locally
|

All gauge sets happen *outside* `cb.mu` to avoid taking the metrics lock
under the cache mutex.

There is a small inherent staleness window: a circuit-breaker entry
whose TTL just expired remains at gauge=1 until the cache TTL elapses
(5s default) and `refreshLocal` runs. That's bounded and fine for a
dashboard.

### `path_endpoints_in_cooldown`

New `LeaderboardDataProvider.GetCooldownCountData(ctx)` method,
implemented on Shannon's `Protocol`. Walks active sessions, fetches each
endpoint's reputation score, increments per-(domain, service_id,
rpc_type) when `score.IsInCooldown()` returns true.

Published every 10s alongside the existing leaderboard / mean score /
supplier score metrics. Resets between snapshots so a domain dropping to
zero cooldown'd endpoints shows zero (rather than sticking at its last
value via Prometheus' 5-min staleness window).

## Test plan

- [x] `go test ./...` — all green
- [x] `go vet ./...` — clean  
- [x] New test: `TestDomainCircuitBreaker_MetricGaugeTransitions` covers
MarkBroken → 1, ClearService → 0, TTL expiry + refresh → 0
- [ ] Canary deploy: verify `path_circuit_breaker_state` and
`path_endpoints_in_cooldown` series appear in Prometheus
- [ ] Trigger a circuit break (or use
`/admin/circuit-breaker/clear/{serviceId}` to test the clear path) and
confirm the gauge transitions
- [ ] Verify staleness behavior: entry expires → gauge drops to 0 within
~5s

## Cardinality

Both new metrics are bounded by `services × domains` — already low
cardinality (~50-200 unique domains × ~80 services on mainnet). No risk
of cardinality explosion.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant