WebSocket worker freeze

First, thanks for granian – it's been excellent. Big fan of framework and its design.  I'm running a WebSocket server with a few hundreds of thousands of daily users and sessions lasting several minutes.

So... under sustained WebSocket load, a granian worker intermittently freezes completely -- zero CPU, no Python runs, no access logs, health endpoint stops answering -- for tens of seconds to minutes, then either self-recovers or is liveness-killed. Root cause: WebSocket transport teardown takes the transport async mutexes via `blocking_lock()` on the event-loop thread while holding the GIL and a send parked against a dead-but-not-RST client holds one of those mutexes for the whole TCP retransmit window.

## Environment

granian 2.7.6 (also 2.7.5, 2.7.0 — version-independent), Python 3.10, Linux x86_64, ASGI + uvloop, `--task-impl rust`, single worker, reproduces at `--runtime-threads` 1 and 2. ~700 concurrent WebSocket connections, mostly mobile clients.

## Symptom

A worker stops doing anything: CPU ≈ 0, no logs, no Python executes, the loop's own HTTP health endpoint stops answering. It stays frozen for tens of seconds to minutes, then recovers on its own (when the kernel finally errors the dead peer's socket) or is killed by the orchestrator's liveness probe.

## Captured evidence

`py-spy dump` during a freeze, every time, shows the event-loop thread `active+gil` with this as the deepest Python frame:

```
Thread (MainThread) (active+gil)
    future_watcher (granian/_futures.py:19)   # ← watcher.done()
```

`granian/_futures.py:19` is `watcher.done()` in the ASGI app wrapper (runs after the endpoint coroutine returns). A native `gdb` backtrace at the same moment shows every other thread queued on the GIL (`PyEval_RestoreThread`) and the event-loop thread parked in a futex -- i.e. blocked in native code while holding the GIL, not spinning.

## Root cause

`done()`/`err()` on the WebSocket watcher tear the transport down via `ASGIWebsocketProtocol::tx()`, which takes both transport async mutexes synchronously, on the event-loop thread:

```rust
// src/asgi/io.rs — ASGIWebsocketProtocol::tx()
let mut ws_rx = self.ws_rx.blocking_lock();   // tokio AsyncMutex — parks this OS thread
let mut ws_tx = self.ws_tx.blocking_lock();   // ← and this one
```

That thread holds the GIL (it's reached from `future_watcher` -> `watcher.done()`). `blocking_lock()` parks the thread until the mutex is free.

The mutex it waits on is held by an in-flight send:

```rust
// src/asgi/io.rs — send_message (sketch)
if let Some(ws) = &mut *(transport.lock().await) {   // holds ws_tx ...
    match ws.send(data).await { ... }                // ... across the real TCP write
}
```

If the peer stopped reading and the socket send buffer is full, `ws.send().await` sits in TCP retransmission limbo (~13–30 min on default Linux settings), holding `ws_tx` the whole time.

So the freeze (I guess) is:

1. A connection has a pending, un-cancelled send to a dead/slow client (a per-connection ping or an out-of-band push) holding `ws_tx`.
2. The disconnect is observed and the endpoint coroutine returns.
3. `future_watcher` -> `watcher.done()` → `tx()` → `ws_tx.blocking_lock()` -> the event-loop thread parks, holding the GIL.
4. Every other Python thread blocks on the GIL; uvloop never runs again; even access logs stop; CPU ≈ 0. The worker is frozen until TCP gives up (or an RST frees the mutex earlier — which is  why some freezes self-recover in 24–32 s).

So it seems it's a GIL-holding park on a mutex held across dead-client network I/O. Duration = however long the kernel keeps retrying the peer.

## Trigger condition

The trigger is a WebSocket-layer close/disconnect with TCP kept open and non-draining -- the app observes a 1005/no-status close (empty/no-status close frame, or a disconnect with no status), not a TCP RST. (1005 is reserved and never sent on the wire; it's what the library surfaces for these closes.) An RST instead errors the pending `send().await` first, freeing `ws_tx`, so it does *not* freeze. The 1005/no-status case is exactly what mobile clients produce when the OS backgrounds the app.

The HTTP path is unaffected: `ASGIHTTPProtocol::tx()` is a scoped `std::sync::Mutex` take, never held across I/O. This is WebSocket-only.

## Deterministic confirmation

Reproducible with a single client (`python:3.10-slim`): an out-of-band task pushes large frames to a connection; the client completes the upgrade and never reads (filling the send buffer so the pusher blocks inside `ws.send().await` holding `ws_tx`); the client then closes at the WebSocket layer while keeping TCP open and the handler returns. An HTTP probe confirms the freeze. Frozen on 2.7.0 / 2.7.5 / 2.7.6 and at runtime-threads 1 and 2 (version- and thread-independent); responsive when the client sends an RST instead.

I wonder if I'm doing something wrong and this is not how it supposed to work in the first place. But still looks like a bug.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

WebSocket worker freeze #862

Environment

Symptom

Captured evidence

Root cause

Trigger condition

Deterministic confirmation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

WebSocket worker freeze #862

Description

Environment

Symptom

Captured evidence

Root cause

Trigger condition

Deterministic confirmation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions