Skip to content

WebSocket worker freeze #862

Description

@dearkafka

First, thanks for granian – it's been excellent. Big fan of framework and its design. I'm running a WebSocket server with a few hundreds of thousands of daily users and sessions lasting several minutes.

So... under sustained WebSocket load, a granian worker intermittently freezes completely -- zero CPU, no Python runs, no access logs, health endpoint stops answering -- for tens of seconds to minutes, then either self-recovers or is liveness-killed. Root cause: WebSocket transport teardown takes the transport async mutexes via blocking_lock() on the event-loop thread while holding the GIL and a send parked against a dead-but-not-RST client holds one of those mutexes for the whole TCP retransmit window.

Environment

granian 2.7.6 (also 2.7.5, 2.7.0 — version-independent), Python 3.10, Linux x86_64, ASGI + uvloop, --task-impl rust, single worker, reproduces at --runtime-threads 1 and 2. ~700 concurrent WebSocket connections, mostly mobile clients.

Symptom

A worker stops doing anything: CPU ≈ 0, no logs, no Python executes, the loop's own HTTP health endpoint stops answering. It stays frozen for tens of seconds to minutes, then recovers on its own (when the kernel finally errors the dead peer's socket) or is killed by the orchestrator's liveness probe.

Captured evidence

py-spy dump during a freeze, every time, shows the event-loop thread active+gil with this as the deepest Python frame:

Thread (MainThread) (active+gil)
    future_watcher (granian/_futures.py:19)   # ← watcher.done()

granian/_futures.py:19 is watcher.done() in the ASGI app wrapper (runs after the endpoint coroutine returns). A native gdb backtrace at the same moment shows every other thread queued on the GIL (PyEval_RestoreThread) and the event-loop thread parked in a futex -- i.e. blocked in native code while holding the GIL, not spinning.

Root cause

done()/err() on the WebSocket watcher tear the transport down via ASGIWebsocketProtocol::tx(), which takes both transport async mutexes synchronously, on the event-loop thread:

// src/asgi/io.rs — ASGIWebsocketProtocol::tx()
let mut ws_rx = self.ws_rx.blocking_lock();   // tokio AsyncMutex — parks this OS thread
let mut ws_tx = self.ws_tx.blocking_lock();   // ← and this one

That thread holds the GIL (it's reached from future_watcher -> watcher.done()). blocking_lock() parks the thread until the mutex is free.

The mutex it waits on is held by an in-flight send:

// src/asgi/io.rs — send_message (sketch)
if let Some(ws) = &mut *(transport.lock().await) {   // holds ws_tx ...
    match ws.send(data).await { ... }                // ... across the real TCP write
}

If the peer stopped reading and the socket send buffer is full, ws.send().await sits in TCP retransmission limbo (~13–30 min on default Linux settings), holding ws_tx the whole time.

So the freeze (I guess) is:

  1. A connection has a pending, un-cancelled send to a dead/slow client (a per-connection ping or an out-of-band push) holding ws_tx.
  2. The disconnect is observed and the endpoint coroutine returns.
  3. future_watcher -> watcher.done()tx()ws_tx.blocking_lock() -> the event-loop thread parks, holding the GIL.
  4. Every other Python thread blocks on the GIL; uvloop never runs again; even access logs stop; CPU ≈ 0. The worker is frozen until TCP gives up (or an RST frees the mutex earlier — which is why some freezes self-recover in 24–32 s).

So it seems it's a GIL-holding park on a mutex held across dead-client network I/O. Duration = however long the kernel keeps retrying the peer.

Trigger condition

The trigger is a WebSocket-layer close/disconnect with TCP kept open and non-draining -- the app observes a 1005/no-status close (empty/no-status close frame, or a disconnect with no status), not a TCP RST. (1005 is reserved and never sent on the wire; it's what the library surfaces for these closes.) An RST instead errors the pending send().await first, freeing ws_tx, so it does not freeze. The 1005/no-status case is exactly what mobile clients produce when the OS backgrounds the app.

The HTTP path is unaffected: ASGIHTTPProtocol::tx() is a scoped std::sync::Mutex take, never held across I/O. This is WebSocket-only.

Deterministic confirmation

Reproducible with a single client (python:3.10-slim): an out-of-band task pushes large frames to a connection; the client completes the upgrade and never reads (filling the send buffer so the pusher blocks inside ws.send().await holding ws_tx); the client then closes at the WebSocket layer while keeping TCP open and the handler returns. An HTTP probe confirms the freeze. Frozen on 2.7.0 / 2.7.5 / 2.7.6 and at runtime-threads 1 and 2 (version- and thread-independent); responsive when the client sends an RST instead.

I wonder if I'm doing something wrong and this is not how it supposed to work in the first place. But still looks like a bug.

Metadata

Metadata

Assignees

No one assigned

    Labels

    asgiIssue related to ASGI protocolbugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions