Skip to content

feat(api): XIP-83 d14n binding — bidirectional Subscribe on QueryApi#2020

Merged
tylerhawkes merged 3 commits into
mainfrom
tyler/xip-83-d14n-subscribe
Jun 26, 2026
Merged

feat(api): XIP-83 d14n binding — bidirectional Subscribe on QueryApi#2020
tylerhawkes merged 3 commits into
mainfrom
tyler/xip-83-d14n-subscribe

Conversation

@tylerhawkes

@tylerhawkes tylerhawkes commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

XIP-83 d14n binding — bidirectional QueryApi.Subscribe

Implements the decentralized (d14n) binding of XIP-83 mutable subscription streams on xmtpd. A single long-lived bidirectional stream lets a client mutate its topic set in place (add/remove subscriptions) and keep the connection alive with ping/pong — instead of tearing down and reopening a stream every time group membership changes.

This is the xmtpd counterpart to the node-go v3 MlsApi.Subscribe binding. Same control protocol (Mutate / Started / CatchupComplete / TopicsLive / Ping / Pong); the d14n binding delivers OriginatorEnvelopes and uses per-originator vector cursors.

Design: gated async catch-up

  • Single-writer + single-sender actor per subscription. All topic-set mutation happens on the writer goroutine; all stream sends happen on the sender goroutine (drained via a bounded queue).
  • Per-topic gate + pending buffer. A newly-added topic is caught up by an async fetcher goroutine while already-live topics keep flowing — a slow catch-up never head-of-line-blocks live delivery.
  • No missing messages across the switch. A topic is registered live before its catch-up snapshot is taken, and delivery is deduped by a monotonic per-originator cursor (advanceTopicCursors). Anything that lands during catch-up is either in the snapshot or arrives live; the dedup drops the overlap.
  • Sparse vs. filled cursors. The writer keeps only the originator cursors the client provided (growing them as new originators are seen); the fetcher holds a filled copy purely for the query. Keeps memory bounded at the active-topic ceiling.
  • 1M active-topic ceiling (maxActiveSubscribeTopics) — ~16–32MB for a consumer that large, deliberately favored over forcing clients into multiple connections (which would invite rate-limiting and wreck some topologies).
  • Liveness. Either peer may Ping; the receiver must Pong the nonce. No pong within the deadline → the node reaps the vanished peer (e.g. a mobile client the OS suspended behind a proxy that still ACKs the transport).

Commits

  1. feat(api): add mutableSubscription handle to subscribeWorker — in-place topic add/remove on the existing subscribe worker, guarded by topicsMu.
  2. feat(api): XIP-83 bidirectional Subscribe handler on QueryApi + xmtpv4 protos — the handler (subscribe.go) + regenerated xmtpv4 protos.
  3. test(api): Subscribe handler tests + configurable keepalive interval for tests — 5 race-tested scenarios + a test-only knob to shorten the keepalive cadence.

Tests

go test -race ./pkg/api/message — all green:

  • TestSubscribe_CatchUpThenLive
  • TestSubscribe_MutateRemoveStopsDelivery
  • TestSubscribe_HalfCloseHistoryOnlyDrains
  • TestSubscribe_HistoryOnlyOnLiveRejected
  • TestSubscribe_NoPongIsReaped

Dependency & merge ordering

Built on xmtp/proto#338 (QueryApi.Subscribe). The committed .pb.go here is generated from that branch via dev/gen/protos; once #338 merges, regenerating from proto main produces byte-identical output.

Note: proto#338's Test (Golang) check cross-compiles xmtpd main against the new proto and will stay red until this handler lands on main (xmtpd's *Service must implement Subscribe). Recommended order: merge proto#338 (its red Golang check is expected), then this.

🤖 Generated with Claude Code

Note

Add bidirectional Subscribe RPC to QueryApi with mutable topic membership and catch-up history

  • Implements the XIP-83 Subscribe bidi-streaming RPC in pkg/api/message/subscribe.go: clients send Mutate frames to add/remove topics at runtime; the server delivers history catch-up, TopicsLive, and CatchupComplete markers per mutate, then routes live envelopes.
  • Introduces mutableSubscription in pkg/api/message/mutable_subscription.go so topic registration is updated in-place without reconnecting; add/remove are idempotent and safe under concurrent worker reaping.
  • Adds subscribeSession state machine managing ping/pong keepalive, backpressure-bounded sending, pending-byte caps, in-flight wave limits, and graceful half-close drain.
  • Fixes closeListener in pkg/api/message/subscribe_worker.go to set the closed flag under topicsMu before channel close, preventing re-registration races and send-on-closed-channel bugs.
  • Adds generated protobuf, gRPC, and Connect-Go stubs for SubscribeRequest/SubscribeResponse and related message types across pkg/proto/.

Macroscope summarized c44972f.

@tylerhawkes tylerhawkes requested a review from a team as a code owner June 18, 2026 22:26
@octane-security-app

Copy link
Copy Markdown

Summary by Octane

New Contracts

  • mutable_subscription.go: The smart contract manages dynamic topic subscriptions for message streams, allowing topics to be added or removed in real-time.
  • subscribe.go: The smart contract facilitates bidirectional mutable subscriptions, allowing topic-based stream management with live updates and history retrieval.

Updated Contracts

  • listener.go: Added a mutex for thread-safe topic management in the listener to handle concurrency issues.
  • subscribe_worker.go: Concurrent access protection added for topic mutations using locking, ensuring safe modifications.
  • message_api.pb.go: The smart contract adds new subscription functionalities with "V1 Mutate" and "Ping/Pong" features.
  • query_api.connect.go: Introduced XIP-83 bidirectional mutable subscriptions for dynamic topic management, requiring HTTP/2.
  • query_api.pb.go: Added new "Subscribe" method to the smart contract API.
  • query_api_grpc.pb.go: Added a bidirectional Subscribe method, allowing clients to maintain mutable subscriptions with real-time topic adjustments.
  • api.go: Added configurable keep-alive interval for server ping/pong handling in test API server.

🔗 Commit Hash: c44b65c

Comment thread pkg/api/message/subscribe.go
Comment on lines +47 to +56
m.l.topicsMu.Lock()
defer m.l.topicsMu.Unlock()
if m.l.closed {
return
}
m.worker.topicListeners.addListener(keys, m.l)
for k := range keys {
m.l.topics[k] = struct{}{}
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Critical message/mutable_subscription.go:47

mutableSubscription.addTopics checks m.l.closed while holding topicsMu, but subscribeWorker.closeListener sets l.closed = true without taking that lock. A concurrent mutate can pass the stale closed check and re-add the listener to topicListeners after its channel is closed, causing dispatchToListeners to panic on send.

 func (m *mutableSubscription) addTopics(keys map[string]struct{}) {
 	if len(keys) == 0 {
 		return
 	}
-	m.l.topicsMu.Lock()
-	defer m.l.topicsMu.Unlock()
-	if m.l.closed {
+	m.worker.topicListeners.mu.RLock()
+	defer m.worker.topicListeners.mu.RUnlock()
+	m.l.topicsMu.Lock()
+	defer m.l.topicsMu.Unlock()
+	if m.l.closed {
 		return
 	}
 	m.worker.topicListeners.addListener(keys, m.l)
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @pkg/api/message/mutable_subscription.go around lines 47-56:

`mutableSubscription.addTopics` checks `m.l.closed` while holding `topicsMu`, but `subscribeWorker.closeListener` sets `l.closed = true` without taking that lock. A concurrent mutate can pass the stale `closed` check and re-add the listener to `topicListeners` after its channel is closed, causing `dispatchToListeners` to panic on send.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. closeListener now sets l.closed under l.topicsMu, so a concurrent mutableSubscription.addTopics observes it atomically with its addListener call and can no longer re-register a listener after its channel is closed. Added a -race regression test (TestMutableSubscriptionAddRaceWithClose) that hammers addTopics against the reap.

Comment thread pkg/api/message/subscribe.go
Comment on lines +368 to +373
wire := make([][]byte, 0, len(w.topics))
for _, t := range w.topics {
if w.historyOnly {
wire = append(wire, t.wire)
continue
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Medium message/subscribe.go:368

handleCatchUp appends history_only topics to wire at line 371 and emits TopicsLive for them at line 392, but these topics are never added to liveTopics so no live envelopes will actually arrive. A client receiving the TopicsLive frame incorrectly believes these topics are now live. Consider skipping the TopicsLive announcement entirely for history_only waves.

 	wire := make([][]byte, 0, len(w.topics))
 	for _, t := range w.topics {
 		if w.historyOnly {
-			wire = append(wire, t.wire)
 			continue
 		}
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file @pkg/api/message/subscribe.go around lines 368-373:

`handleCatchUp` appends `history_only` topics to `wire` at line 371 and emits `TopicsLive` for them at line 392, but these topics are never added to `liveTopics` so no live envelopes will actually arrive. A client receiving the `TopicsLive` frame incorrectly believes these topics are now live. Consider skipping the `TopicsLive` announcement entirely for `history_only` waves.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working as specified, leaving as-is. The proto contract for the history_only field explicitly mandates these markers — it says the adds are caught up via "history, TopicsLive markers, and the wave's CatchupComplete" — and TopicsLive is documented informational-only ("delivery correctness (no duplicates, no gaps) never depends on it"). A client that opted into history_only interprets TopicsLive as the "everything as of now" boundary, not "a live tail follows"; it learns there is no live tail because it sent history_only (and typically half-closes). Suppressing the marker here would make this binding diverge from the field contract and from the v3 MlsApi binding.

@macroscopeapp

macroscopeapp Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Approvability

Verdict: Needs human review

3 blocking correctness issues found. This PR introduces a significant new bidirectional Subscribe RPC feature (~1500 lines) with complex concurrency patterns. Additionally, there are unresolved review comments including a critical race condition that could cause panics, and the author does not own any of the modified files.

You can customize Macroscope's approvability policy. Learn more.

@octane-security-app

Copy link
Copy Markdown

Overview

Vulnerabilities found: 18                                                                                
Severity breakdown: 6 High, 1 Medium, 6 Low, 5 Informational
Warnings found: 1                                                                                

Detailed findings

pkg/api/message/listener.go

  • Data race on listener.closed and non-idempotent removal in XIP-83 mutable subscription causes durable listener leaks and node-level DoS risk. See more
  • Write-locked O(N) listener removal in mutable Subscribe (teardown/removeTopics) causes server-wide stalls of subscription mutations. See more

pkg/api/message/subscribe.go

  • Per-send uncancelable timer allocation (time.After) in QueryApi.Subscribe causes resource exhaustion/DoS risk. See more
  • Excessive live-topic budget and missing rate limits in QueryApi.Subscribe cause low-effort node DoS via memory exhaustion. See more
  • Missing rate limiting and cursor monotonicity in QueryApi.Subscribe mutable subscriptions causes unmetered in-place history replays and node resource exhaustion. See more
  • Unbatched, size-unchecked flush in mutable Subscribe catch-up in QueryApi.Subscribe causes subscriber stream abort (DoS). See more
  • Unmetered bidi QueryApi.Subscribe without rate limiting and silent ignore of unknown request types causes single-node volumetric DoS. See more
  • Overlapping catch-up waves and missing rate limiting in QueryApi.Subscribe cause resource-exhaustion risk. See more
  • TTL-cached originator enumeration in XIP-83 Subscribe catch-up causes missed historical topic envelopes. See more
  • Count-based buffering of large stream elements in QueryApi.Subscribe causes single-node memory exhaustion/DoS. See more
  • Keepalive Ping/Pong not suspended after client half-close in bidi Subscribe causes premature stream aborts during bounded sync. See more
  • Non-atomic mutate handling in XIP-83 Subscribe causes unexpected partial application and stream teardown. See more
  • Immediate per-topic state removal in mutable Subscribe (pkg/api/message/subscribe.go) causes loss of in-flight envelopes at unsubscribe boundary. See more
  • Lack of per-topic wave serialization and mixed dedup state in mutable Subscribe causes duplicate and out-of-order delivery. See more
  • Half-close + flush path not draining live producer channel in bidi Subscribe causes tail live envelopes to be omitted at stream close. See more
  • Missing sender error check in mutable Subscribe flush() causes false-OK termination and missed final frames. See more

pkg/api/server.go

  • HTTP/2 per-stream read timeout from http.Server.ReadTimeout in API server causes long‑lived QueryApi.Subscribe streams to be force‑closed and enables DoS via reconnection churn. See more

pkg/interceptors/server/rate_limit.go

  • Unbounded concurrent catch-up waves and missing opens limit in QueryApi.Subscribe cause node-level DoS risk. See more

Warnings

pkg/api/message/subscribe.go

  • removes-first mutate ordering in Subscribe allows same-frame history_only reset causing silent loss of live delivery. See more

🔗 Commit Hash: c44b65c
🛡️ Octane Dashboard: All vulnerabilities

@tylerhawkes tylerhawkes force-pushed the tyler/xip-83-d14n-subscribe branch from c44b65c to c8131f9 Compare June 19, 2026 06:40
@tylerhawkes

Copy link
Copy Markdown
Contributor Author

Pushed an update addressing the Macroscope findings (3 fixed, 1 working-as-specified per the history_only proto contract — see inline replies).

Beyond those, I did a deeper self-review of the concurrent paths and hardened several edge cases, each with a regression test:

  • Liveness: decoupled the pong-reap deadline from the delivery/ping cadence (a single shared ticker meant steady outbound traffic could postpone the reap indefinitely; long half-close drains could be falsely reaped). Ping cadence now tracks real sends; the reap deadline is independent and suppressed during half-close.
  • Re-add: a plain re-add of an already-live topic (no remove) is now a no-op instead of resetting the cursor and replaying — replay still requires remove+re-add.
  • Teardown: flush() now propagates a sender send-error instead of returning a false OK with the wave's terminal frames undelivered; the stream context is cancelable so catch-up fetchers tear down promptly.
  • Writer latency: the per-Mutate originator-list lookup (a DB round-trip on a cache miss) moved off the single writer goroutine into the catch-up fetcher, so a slow DB can no longer stall liveness or false-reap a healthy stream.
  • Overlap: rejecting a second overlapping catch-up for the same topic (including via remove+re-add over an in-flight history_only wave).

go test -race ./pkg/api/message is green; golangci-lint clean.

Comment thread pkg/api/message/subscribe.go
Comment thread pkg/api/message/subscribe.go
Comment thread pkg/api/message/subscribe.go Outdated
@tylerhawkes tylerhawkes force-pushed the tyler/xip-83-d14n-subscribe branch from c8131f9 to a58bba6 Compare June 22, 2026 20:31
@tylerhawkes

Copy link
Copy Markdown
Contributor Author

Response to the Octane security review

Thanks — these were valuable. Note Octane scanned commit c44b65c (the first push); several findings were already fixed by a deep-review pass in the current commit, and I've now fixed the rest that are real. Triage of all 18:

✅ Fixed in this push (in direct response to Octane + Macroscope)

  • Per-send time.After allocationsend() now reuses one per-session *time.Timer instead of allocating a timer (live for the whole keepalive) on every send.
  • Unbatched / size-unchecked flushsendEnvelopes now splits into ≤2 MiB frames, so a large catch-up page or a flushed pending buffer can never go out as one oversized, stream-aborting frame.
  • Cursor int32/int64 overflow (Macroscope too) → out-of-range LastSeen entries are now rejected with InvalidArgument instead of being silently dropped / poisoning the live cursor.
  • Catch-up spin on an unparseable row (Macroscope too) → pagination cursors now advance from raw rows, guaranteeing forward progress.

✅ Already fixed in the current commit (Octane scanned the pre-fix c44b65c)

  • Data race on listener.closed + non-idempotent removalcloseListener now sets closed under topicsMu; race test added.
  • Keepalive ping/pong not suspended after half-close → ping send + reap are gated on !halfClosed.
  • Non-atomic mutate handlinghandleMutate fully validates before applying.
  • Lack of per-topic wave serialization / mixed dedup (dupes + out-of-order) → each gated topic records its owning wave (catchingUp map[key]int); a stale wave can't open a reset topic's gate, and its in-flight pages are dropped.
  • Missing sender error check in flush()flush() now drains sendErrCh and propagates instead of a false OK.

⚖️ By design (with rationale)

  • 1M live-topic budget + "missing rate limits" → the 1M cap is intentional (~16–32 MB; a herald wants one fat connection, not fan-out that just hits per-connection limits). Request-rate is bounded by the existing API rate-limit interceptor; the per-topic memory ceiling and the 64 MiB pending cap are the resource bounds.
  • In-place "history replays" / cursor monotonicity → a plain re-add of an already-live topic is now a no-op (no replay); replay requires explicit remove+re-add. So unmetered replay-by-re-add is gone.
  • Silent ignore of unknown request types → forward-compat by design: unknown response/request features are gated behind the Started.capabilities advertisement (a client must not send an unadvertised type); an unknown version arm IS rejected with InvalidArgument.
  • Immediate per-topic removal drops in-flight envelopes → intended: removing a topic means stop delivering it, including buffered/in-flight envelopes.
  • Half-close doesn't drain the live producer tail → the half-close drain finishes in-flight catch-up waves then closes; a client that half-closed for a bounded sync isn't expecting a live tail.
  • removes-first ordering allows same-frame history_only reset → the client explicitly asked for history_only on that topic; and remove+re-add over an in-flight history_only wave is now rejected (added this pass + test).

📋 Acknowledged — shared infra / deployment (out of scope for this PR, flagged for follow-up)

  • Write-locked O(N) listener removal and count-based l.ch buffering live in the shared subscribeWorker/listenersMap dispatch layer (predates this PR); the mutable subscription exercises removal more, so a finer-grained or sharded listener map is a worthwhile follow-up.
  • http.Server.ReadTimeout force-closing long-lived streams is a server-config concern that affects every streaming RPC (including the existing SubscribeTopics), not this handler — worth confirming the deploy config exempts bidi/streaming.
  • TTL-cached originator enumeration — same eventual-consistency note as the Macroscope history_only thread; live waves are unaffected (topic-based gate).
  • Unbounded concurrent catch-up waves — bounded today by the API request-rate limiter (each wave = one Mutate); a per-stream in-flight-wave cap would be reasonable defense-in-depth if we want it.

go test -race ./pkg/api/message is green; golangci-lint clean; new regression tests cover the cursor-overflow, catch-up-spin, frame-splitting, and the already-fixed concurrency/teardown items.

Comment thread pkg/api/message/subscribe.go
Comment thread pkg/api/message/mutable_subscription.go
Comment thread pkg/api/message/subscribe.go Outdated
…lace topic add/remove

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@tylerhawkes tylerhawkes force-pushed the tyler/xip-83-d14n-subscribe branch from a58bba6 to eccc068 Compare June 24, 2026 06:08
@tylerhawkes

Copy link
Copy Markdown
Contributor Author

Handler state-management redesigned to make the recurring bug-classes structurally impossible

Across the prior review rounds (Macroscope ×3 + a deep multi-agent pass), the findings kept clustering into the same two structural causes rather than independent bugs. Instead of patching symptoms again, I restructured the handler so those classes can't be expressed:

1. Per-topic state was four parallel maps (catchingUp, cursors, liveTopics, pending), kept consistent by hand at every call site — so each new code path was a fresh chance to update one and forget another (most of the recurring findings). Collapsed into one map[string]*topicState, a small gated → live state machine; every transition goes through a method (gateTopic / bufferLive / flushAndGoLive / removeTopicState) that moves the whole struct together. The parallel waveCursors map folded onto subscribeWave.cursors (history_only waves only).

2. The sender result was a (sendErrCh, cap 1) + senderDone pair with a non-blocking re-read in flush — the source of the false-OK class, plus a latent single-consumer race on the buffered channel. Collapsed into a single broadcast-on-close contract (senderDone + a sendErr field): the sender records its terminal status, then closes; every reader learns the outcome by observing the close and reading the field.

All previously-flagged fixes are folded in: removeListener idempotency, flush returning CodeCanceled on ctx-cancel, cursor-overflow rejection, frame-splitting, and the in-flight history_only overlap guard.

Review

Two independent multi-agent review rounds (5 dimensions each; every finding adversarially verified by 3 skeptics). The topicState invariants, the new sender contract, the cap/pendingBytes arithmetic, and wave-ownership all got clean sign-offs. Confirmed issues across both rounds, all addressed:

  • A pre-existing teardown Send-race — the bounded wait on senderDone abandons a sender wedged inside connect's non-cancelable stream.Send on a non-reading client, so Close can race the response writer. It's a connect-go limitation (an unbounded wait would hang the handler instead); documented honestly in the teardown comment and tracked separately.
  • An unbounded in-flight-waves DoS gap (new, found by the structural review): the active-topic cap didn't bound concurrent catch-up fetcher goroutines, and remove + re-add churn nets zero. Fixed with a maxInflightSubscribeWaves cap + test.

Build clean, tests race-clean, golangci-lint v2.10.1: 0 issues. Note: the line anchors on the earlier inline comments have moved with the rewrite — replies inline.

Comment thread pkg/api/message/subscribe.go
Comment thread pkg/api/message/subscribe.go
tylerhawkes and others added 2 commits June 24, 2026 16:59
…4 protos

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…for tests

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@tylerhawkes tylerhawkes merged commit ac17e82 into main Jun 26, 2026
19 checks passed
@tylerhawkes tylerhawkes deleted the tyler/xip-83-d14n-subscribe branch June 26, 2026 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants