fix(hermes): bound WS/SSE connection memory via slow-consumer disconnect + 24h deadline by keyvankhademi · Pull Request #3782 · pyth-network/pyth-crosschain

keyvankhademi · 2026-05-29T17:52:45Z

Problem

The Hermes server suffers unbounded memory growth. The cause is WS and SSE streaming connections that stall (or stop reading) and are never reclaimed — their task, broadcast receiver, and buffers stay pinned until the OS TCP timeout (potentially hours), so connections accumulate and memory grows without bound.

The broadcast channel itself is not the leak (it's capacity-1000 with tiny {slot} events). The real issue is that the timeout/deadline logic lived inside a poll path that a stalled client freezes:

WS: when a price update is selected, the loop awaits handle_price_feeds_update().await to completion. A slow consumer parks it in sender.flush().await, so the sibling tokio::select! branches — the 30s ping timeout and the 24h deadline — can never fire. The connection stays pinned indefinitely.
SSE: the 24h deadline and slow-consumer (Lagged) detection lived inside the response-body stream. In axum 0.6 / hyper 0.14 a response body is only polled while the socket can accept bytes, so a client that stops reading freezes the stream entirely — neither the deadline nor the lag check ever runs.

Fix

Implements slow-consumer disconnect and a 24h hard deadline for both transports.

WebSocket (api/ws.rs)

All socket writes go through send/feed/flush/close_to_client helpers wrapped in WS_SEND_TIMEOUT (10s). On timeout → record SlowConsumer and drop the connection. This bounds how long a connection can be parked on a send, which in turn keeps the ping and 24h-deadline select! branches reliably reachable.
24h-deadline closes are now metered (Interaction::ConnectionTimeout) and sent best-effort, so an idle/slow client at the deadline isn't also miscounted as a slow consumer. close_client timeouts (e.g. mass closes on deploy) are no longer mislabeled either.

SSE (api/rest/v2/sse.rs)

Reworked so an independent spawned task owns the broadcast receiver, races a pinned 24h sleep_until against the broadcast, and forwards events into a bounded 16-slot channel (SSE_CHANNEL_CAPACITY) with a 10s send timeout (SSE_SEND_TIMEOUT); the response body is just a ReceiverStream.
A client that stops reading fills the channel → the send times out → the task breaks → the broadcast receiver is freed within ~10s, independent of whether the client ever reads. The 24h deadline is now runtime-driven, so it fires for idle/non-reading clients too.
Slow consumers (broadcast lag or send timeout) are now disconnected with a best-effort final error event, instead of lingering.

Metrics (api/metrics_middleware.rs)

New counters: sse_slow_consumer_disconnects, sse_connection_timeouts (plus the WS SlowConsumer / ConnectionTimeout interaction labels) so the fix is observable in Prometheus.

Behavior change

SSE now disconnects a client that falls >1000 updates behind (previously it emitted an error event and kept streaming). This matches the WS behavior and is documented in the handler doc-comment — clients should reconnect on a Slow consumer / timeout error event.

Validation

cargo check, cargo clippy clean (only pre-existing protobuf warnings), cargo test 33/33 pass.
Two adversarial multi-agent review rounds against the actual axum 0.6.20 / hyper 0.14.28 source confirmed the redesign genuinely bounds memory for the non-reading-client case, with no critical/high defects and no orphan-task or missed-wakeup races.

Notes / possible follow-ups (not in this PR)

For a client that holds the socket open but never reads or sends a RST, the lightweight hyper connection object lingers until the OS TCP timeout; the heavy broadcast receiver is still freed at ~10s. Fully reclaiming it would need socket-level timeouts (TCP keepalive / SO_SNDTIMEO).
No max-connections cap — this PR bounds per-connection lifetime, not connection count (admission control is a separate lever).
The runtime-only teardown paths have no automated test yet (none exist for these handlers); a tokio::time-paused test would be high-value but needs new ApiState test scaffolding.

🤖 Generated with Claude Code

…ect + 24h deadline Hermes WS and SSE connections that stalled or stopped reading were never reclaimed — their task, broadcast receiver, and buffers stayed pinned until the OS TCP timeout (hours), growing server memory without bound. The root cause was timeout/deadline logic living inside a poll path that a stalled client freezes: - WS: when a price update is selected, the loop awaits the handler to completion; a slow consumer parks it in `flush().await`, so the sibling `select!` branches (30s ping, 24h deadline) can never fire. - SSE: the deadline and slow-consumer (Lagged) logic lived in the response body stream, which axum/hyper only polls while the socket can accept bytes, so a non-reading client froze it entirely. Fix: - WS: route every socket write through helpers wrapped in a 10s WS_SEND_TIMEOUT; on timeout, drop the connection (SlowConsumer metric). This bounds parked time so the ping and 24h-deadline branches stay reachable. Meter deadline closes (ConnectionTimeout) and keep best-effort teardown closes from being mislabeled as slow consumers. - SSE: drive the broadcast drain + 24h deadline from an independent spawned task that forwards into a bounded 16-slot channel with a 10s send timeout; the response body is just a ReceiverStream. A client that stops reading fills the channel, the send times out, and the broadcast receiver is freed within ~10s regardless of whether the client reads. Slow consumers (lag or send timeout) are now disconnected instead of lingering. - Add sse_slow_consumer_disconnects and sse_connection_timeouts metrics. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

vercel · 2026-05-29T17:52:51Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
api-reference	Ready	Preview, Comment	May 29, 2026 6:10pm
component-library	Ready	Preview, Comment	May 29, 2026 6:10pm
developer-hub	Ready	Preview, Comment	May 29, 2026 6:10pm
entropy-explorer	Ready	Preview, Comment	May 29, 2026 6:10pm
insights	Error		May 29, 2026 6:10pm
proposals	Ready	Preview, Comment	May 29, 2026 6:10pm
staking	Ready	Preview, Comment	May 29, 2026 6:10pm

chatgpt-codex-connector · 2026-05-29T17:52:52Z

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

devin-ai-integration

Devin Review found 2 potential issues.

View 3 additional findings in Devin Review.

devin-ai-integration · 2026-05-29T17:59:41Z

🚩 No tests for new slow-consumer and timeout logic

Neither the SSE nor the WebSocket modules have test modules (#[cfg(test)]). This PR introduces significant new behavior — spawned producer task, bounded mpsc channel, send timeouts, slow-consumer detection, broadcast lag handling, new metrics — none of which have corresponding tests. Per REVIEW.md: "New functionality should have tests." This is worth raising as a follow-up: integration tests verifying that a slow consumer is disconnected after the timeout, and that metrics are correctly incremented, would substantially increase confidence in these changes.

Was this helpful? React with 👍 or 👎 to provide feedback.

devin-ai-integration · 2026-05-29T17:59:43Z

+                        Some(Err(BroadcastStreamRecvError::Lagged(skipped))) => {
+                            tracing::info!(
+                                skipped,
+                                "SSE client is a slow consumer (broadcast lag); closing connection.",
+                            );
+                            state.metrics.sse_slow_consumer_disconnects.inc();
+                            let _ = tx.try_send(Ok(Event::default().event("error").data(format!(
+                                "Slow consumer: dropped {skipped} updates. Closing connection.",
+                            ))));
+                            break;
+                        }


🚩 Behavioral change: broadcast lag now terminates SSE connections instead of continuing

In the old code, when the broadcast channel lagged (BroadcastStreamRecvError::Lagged), an error event was sent to the client and the stream continued (with gaps in price data). In the new code (sse.rs:166-175), broadcast lag is treated as a slow consumer — the connection is closed after a best-effort error event. This is an intentional behavioral change that forces clients to reconnect for fresh data, but it could be surprising for existing clients that previously handled lag gracefully by just continuing to consume the stream.

Was this helpful? React with 👍 or 👎 to provide feedback.

main already had #3769 ("disconnect slow WS/SSE consumers to prevent OOM"), which solves the same problem this branch does but with a different mechanism (tungstenite write-buffer cap + an RPC_DISCONNECT_SLOW_CONSUMERS config flag + protocol-labelled metrics). Per request, this branch's solution is kept for the overlap: - ws.rs, sse.rs, metrics_middleware.rs: kept this branch's versions (per-write WS_SEND_TIMEOUT; SSE producer task + bounded channel; the sse_slow_consumer_disconnects / sse_connection_timeouts counters). - api.rs, config/rpc.rs, rest.rs: reverted #3769's StreamingConfig scaffolding, since this solution is always-on and does not read it (a config flag that silently did nothing would be a footgun). - Cargo.toml: kept the version bump to 0.11.0; dropped #3769's now-unused direct `tungstenite` dependency (it remains transitively via axum). - network/wormhole.rs: kept #3769's `dead_code` allow (orthogonal CI fix). All other incoming main changes (CI workflow bumps, fortuna/quorum/etc.) are taken as-is. Verified: cargo check + clippy clean, 33/33 tests pass.

vercel Bot deployed to Preview – component-library May 29, 2026 17:53 View deployment

vercel Bot deployed to Preview – entropy-explorer May 29, 2026 17:53 View deployment

vercel Bot deployed to Preview – proposals May 29, 2026 17:53 View deployment

vercel Bot deployed to Preview – staking May 29, 2026 17:53 View deployment

vercel Bot deployed to Preview – api-reference May 29, 2026 17:54 View deployment

vercel Bot had a problem deploying to Preview – insights May 29, 2026 17:54 Failure

vercel Bot deployed to Preview – developer-hub May 29, 2026 17:54 View deployment

devin-ai-integration Bot reviewed May 29, 2026

View reviewed changes

vercel Bot deployed to Preview – component-library May 29, 2026 18:07 View deployment

vercel Bot deployed to Preview – entropy-explorer May 29, 2026 18:07 View deployment

vercel Bot deployed to Preview – staking May 29, 2026 18:08 View deployment

vercel Bot deployed to Preview – proposals May 29, 2026 18:08 View deployment

vercel Bot deployed to Preview – api-reference May 29, 2026 18:08 View deployment

fix(hermes): bump version to 0.12.0 in Cargo.toml and Cargo.lock

7e9dce8

vercel Bot had a problem deploying to Preview – insights May 29, 2026 18:08 Failure

vercel Bot deployed to Preview – component-library May 29, 2026 18:09 View deployment

vercel Bot deployed to Preview – entropy-explorer May 29, 2026 18:10 View deployment

vercel Bot deployed to Preview – proposals May 29, 2026 18:10 View deployment

vercel Bot deployed to Preview – staking May 29, 2026 18:10 View deployment

vercel Bot deployed to Preview – api-reference May 29, 2026 18:10 View deployment

vercel Bot had a problem deploying to Preview – insights May 29, 2026 18:10 Failure

vercel Bot deployed to Preview – developer-hub May 29, 2026 18:10 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hermes): bound WS/SSE connection memory via slow-consumer disconnect + 24h deadline#3782

fix(hermes): bound WS/SSE connection memory via slow-consumer disconnect + 24h deadline#3782
keyvankhademi wants to merge 3 commits into
mainfrom
hermes-ws

keyvankhademi commented May 29, 2026 •

edited by devin-ai-integration Bot

Loading

Uh oh!

vercel Bot commented May 29, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented May 29, 2026

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

devin-ai-integration Bot May 29, 2026

Uh oh!

devin-ai-integration Bot May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

keyvankhademi commented May 29, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Fix

Behavior change

Validation

Notes / possible follow-ups (not in this PR)

Uh oh!

vercel Bot commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chatgpt-codex-connector Bot commented May 29, 2026

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

devin-ai-integration Bot May 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

keyvankhademi commented May 29, 2026 •

edited by devin-ai-integration Bot

Loading

vercel Bot commented May 29, 2026 •

edited

Loading