fix: rank fallback endpoints by least-stale instead of fully-random#512
Conversation
When all session endpoints fail QoS validation, the prior fallback picked randomly from the unfiltered pool — a far-behind endpoint and a slightly-behind endpoint had equal probability. This concentrated traffic on whichever stable single-endpoint suppliers happened to survive in the pool, giving them a hedge-magnet boost they hadn't earned and starving suppliers with more endpoints that had transient block-height-lag observations. Replaces both the EVM `STANDARD_FALLBACK_RANDOM` / `ARCHIVAL_FALLBACK_RANDOM` paths and their cosmos/solana equivalents with `selectLeastStaleEndpoints`, which scores each candidate by `perceivedBlockHeight - endpointBlockHeight`, random-shuffles, then stable-sorts so: - endpoints with known block heights always rank above no-data candidates - within ties, traffic spreads randomly across equally-stale endpoints - cold start (no chain context) degrades gracefully to prior random behavior EVM also extracts the URL→block-height map build into a shared `buildURLBlockHeightMap` helper, deduplicating logic that previously lived in `filterValidEndpointsWithDetails` and `filterStaleURLEndpoints`. Adds test coverage for ranking, no-data deprioritization, and tie-break distribution across all three QoS implementations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…pliers The hedge race fan-out was double-counting in path_relays_total: both the primary AND the hedge branch's relay completed and called RecordRelay with request_type="normal", inflating the visible RPS for whichever endpoint happens to be picked as hedge most often (typically the lowest-latency one). Symptom observed on canary at 22m post-deploy of #512: rpcgate.xyz with 1 endpoint at 431 RPS while stakenodes.org with 21 endpoints averaged 7.7 RPS per endpoint. Per-endpoint share was 17x off baseline despite uniform-random tier-1 selection. The skew traced to (a) hedge metric double-count and (b) hedge losers getting latency-penalty signals for slowness that only mattered inside the race they already lost. This change: - Adds RelayTypeHedge metric constant - Adds MarkAsHedge() to ProtocolRequestContext interface - Hedge branch is tagged before HandleServiceRequest; protocol layer records the relay under request_type="hedge" instead of "normal" - Skips recordLatencyPenaltySignalsIfNeeded for hedge branches — far-from- gateway endpoints no longer accumulate slow-response reputation penalties for losing hedge races they only entered because of distance - Reputation success/error signals still fire so endpoints get fair feedback about whether they actually work Dashboards filtering request_type="normal" by default now show fair primary- traffic distribution. Operators can flip the variable to "hedge" to see the backup-attempt side independently. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Update: bundled hedge-tagging fix into this PR22-min canary observation surfaced a separate skew the original PR didn't address:
Selection within tier-1 IS uniform random, but Fix (commit 94283d9)
Dashboard impact
Test coverage
Will be picked up by the scheduled 1h soak check at ~14:29 Europe/Madrid. |
🟢 Canary soak verdict — GO for mainnet (with one caveat)Soak ran from Go/no-go signals
Anomalies
RecommendationGO for mainnet deploy. The canonical bug fix is verified live (spacebelt-on-solana leak closed) and no health regressions are visible. Suggest deploying with the same observation pattern: watch Deploy command (whatever is your standard mainnet promotion): # Whichever script promotes the canary image SHA to mainnetSoak caveat to noteThe hedge-tagging commit landed mid-soak. If you want a strict 1h-on-current-binary check before mainnet, wait until ~13:15Z and re-run. If you trust the squash-merge approach (both commits ship together either way), deploying now is fine — the canonical signal is unambiguous. |
✅ Mainnet deploy verified healthyRolled to mainnet at
The "spacebelt = 0 req/s" canonical check was canary-specific (their score was still 0 on canary at observation time). On mainnet they've already recovered, which is the correct outcome — the fix doesn't suppress legitimate traffic to recovered endpoints, it just stops the fallback bypass when they're below threshold. Cold-start skew on poly normal-traffic distribution (rpcgate at 61%) is the expected bootstrap artifact and will flatten as reputation equilibrates over the next 30-60 min. Not a regression. PR can be marked merged. Closing the loop on the canary soak prediction: GO was correct. |
… Prometheus metrics (#513) ## Summary Two metrics that were previously only observable via `/ready` introspection or direct Redis access are now first-class Prometheus gauges: - **`path_circuit_breaker_state{service_id, domain}`** — 1 if domain is currently locked out, 0 otherwise - **`path_endpoints_in_cooldown{domain, rpc_type, service_id}`** — count of endpoints currently in strike cooldown Both metrics fill real operational visibility gaps that came up during PR #512 work — operators could see broken-domain effects in error-rate graphs but couldn't directly answer "which domains are circuit-broken right now?" without Redis access. ## What this enables on dashboards Once Grafana picks these up: - **"Currently Broken Domains" table** — list-of-domains view with service_id + domain + state, sortable by service. - **"Broken Domains" stat tile** — single number for at-a-glance health (0 = healthy, >5 = widespread infra issues). - **"Cooldown" column** on the Supplier Quality table — operators can see "how many of my endpoints are in cooldown right now?" alongside RPS, success%, etc. These dashboard panels are not in this PR (kept to the metric-only diff for review clarity); they'll go in a small follow-up commit that updates `local/observability/dashboards/*.json` once this metric is deployed. ## Implementation notes ### `path_circuit_breaker_state` State transitions in `gateway/domain_circuit_breaker.go`: | Trigger | Gauge becomes | |---|---| | `MarkBroken` | 1 | | `ClearService` | 0 (for every cleared domain) | | `refreshLocal` finds expired entries | 0 (for each expired domain) | | `refreshFromRedis` finds expired entries | 0 (for each), and re-asserts 1 for currently-broken domains so fresh pods that lazily pick up Redis state stay consistent without going through MarkBroken locally | All gauge sets happen *outside* `cb.mu` to avoid taking the metrics lock under the cache mutex. There is a small inherent staleness window: a circuit-breaker entry whose TTL just expired remains at gauge=1 until the cache TTL elapses (5s default) and `refreshLocal` runs. That's bounded and fine for a dashboard. ### `path_endpoints_in_cooldown` New `LeaderboardDataProvider.GetCooldownCountData(ctx)` method, implemented on Shannon's `Protocol`. Walks active sessions, fetches each endpoint's reputation score, increments per-(domain, service_id, rpc_type) when `score.IsInCooldown()` returns true. Published every 10s alongside the existing leaderboard / mean score / supplier score metrics. Resets between snapshots so a domain dropping to zero cooldown'd endpoints shows zero (rather than sticking at its last value via Prometheus' 5-min staleness window). ## Test plan - [x] `go test ./...` — all green - [x] `go vet ./...` — clean - [x] New test: `TestDomainCircuitBreaker_MetricGaugeTransitions` covers MarkBroken → 1, ClearService → 0, TTL expiry + refresh → 0 - [ ] Canary deploy: verify `path_circuit_breaker_state` and `path_endpoints_in_cooldown` series appear in Prometheus - [ ] Trigger a circuit break (or use `/admin/circuit-breaker/clear/{serviceId}` to test the clear path) and confirm the gauge transitions - [ ] Verify staleness behavior: entry expires → gauge drops to 0 within ~5s ## Cardinality Both new metrics are bounded by `services × domains` — already low cardinality (~50-200 unique domains × ~80 services on mainnet). No risk of cardinality explosion. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
When all session endpoints fail QoS validation, PATH fell back to picking a random endpoint from the unfiltered pool. Random selection treated a far-behind endpoint exactly the same as a slightly-behind one, which concentrated fallback traffic on whichever stable single-endpoint suppliers happened to survive in the pool.
We confirmed this in mainnet against
poly: 50 endpoints, all at score=100 / tier=1 / no cooldowns, but per-endpoint RPS spread from ~8 (20-endpoint domains) to ~384 (1-endpoint domains). Cross-referencingpath_qos_filter_rejection_totalagainst/ready/poly?detailed=trueshowed the lopsided distribution traced back to the random fallback path being hit when transient block-height-lag observations rejected most candidates simultaneously.What changes
Replaces
STANDARD_FALLBACK_RANDOM(and the archival / cosmos / solana equivalents) withselectLeastStaleEndpoints:perceivedBlockHeight - endpointBlockHeightEVM uses the existing URL-block-height map (and extracts it as a shared
buildURLBlockHeightMaphelper, deduplicating two prior call sites). Cosmos and solana use per-endpoint block heights from their respective state structs (endpoint.checkCometBFTStatus.latestBlockHeightandendpoint.SolanaGetEpochInfoResponse.BlockHeight).What this does NOT change
RandomSelectMultiplebecause noop has no validation / perceived state / block-height concept. Random there is correct, not a bug.Test plan
go test ./qos/... ./protocol/... -short -count=1— all greengo vet ./qos/...— cleanpath_qos_filter_rejection_total{reason="block_height_lag"}rate is unchanged (filter rejection rate shouldn't move; only the post-rejection routing distribution changes)🤖 Generated with Claude Code