feat(xmtp_mls): KeyPackageCleaner sleeps to the exact rotation deadline (no poll)#3791
feat(xmtp_mls): KeyPackageCleaner sleeps to the exact rotation deadline (no poll)#3791insipx wants to merge 4 commits into
Conversation
…t overflow gloo-timers)
|
Claude finished @insipx's task —— View job Code Review CompleteReview date: 2026-06-30 15:50 UTC SummaryThis is a solid architectural improvement that eliminates ~82M unnecessary wake cycles per day. The implementation is well-structured with good abstractions ( Issues Found1. Race condition in receiver lock acquisition (key_package_cleaner.rs:140-141) The worker acquires the receiver lock once at startup and holds it for the entire lifetime of the loop. If another task (e.g., a test or monitoring code) tries to lock the receiver, it will block indefinitely even though the worker only needs the lock during Fix: Acquire the lock per-iteration instead of holding it across the loop: loop {
let mut receiver = self.context.key_package_channels().receiver.lock().await;
while receiver.try_recv().is_ok() {}
drop(receiver); // Release lock before doing work
let db = self.context.db();
// ... rest of logic
}2. Potential integer underflow in sleep duration (key_package_cleaner.rs:161-163) If the system clock moves backward or there's a delay between computing the deadline and calling Fix: Do the comparison before the cast: let dur = std::time::Duration::from_nanos(
std::cmp::max(0, deadline - xmtp_common::time::now_ns()) as u64
);3. Missing error context in maintain() (key_package_cleaner.rs:184-217) The Suggestion: Add a trace-level log when returning early: if expired.is_empty() && !rotate_due {
tracing::trace!("No maintenance needed (no expired packages, rotation not due)");
return Ok(());
}Minor ObservationsRearmChannel design (rearm_channel.rs)
WASM sleep chunking (time.rs:88-110)
Test coverage
Performance & Security✅ Performance: Eliminates 2,000 DB queries/sec and 82M spans/day - excellent optimization RecommendationApprove after addressing issue #1 (race condition). Issues #2 and #3 are optional improvements that would increase robustness but aren't blockers. |
…ge_rotation_ns, min_key_package_delete_at_ns)
…_package_worker; nudge after queue_key_rotation
3eac7ed to
282a725
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3791 +/- ##
==========================================
+ Coverage 84.40% 84.55% +0.15%
==========================================
Files 409 411 +2
Lines 60138 61146 +1008
==========================================
+ Hits 50759 51702 +943
- Misses 9379 9444 +65 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
ApprovabilityVerdict: Needs human review This PR fundamentally changes the KeyPackageCleaner worker from polling every 5 seconds to deadline-based sleeping (potentially for days). While the author owns all modified files and the code is well-tested, the significant change to this worker's execution model and the importance of key package rotation for messaging security warrants human review. You can customize Macroscope's approvability policy. Learn more. |
…eadline (kill 5s poll); span-gated maintain
282a725 to
8bbf20a
Compare
| /// JS `setTimeout` (via gloo-timers) casts the millisecond value to `i32`, | ||
| /// so any duration > ~24.8 days overflows and fires immediately. Chunking into | ||
| /// at-most-1-day pieces lets callers sleep the full requested duration. | ||
| #[allow(dead_code)] // only called from wasm arm; also exercised by unit tests |
There was a problem hiding this comment.
This should be gated for wasm if that's all it's used for
| impl<Context> KeyPackagesCleanerWorker<Context> | ||
| where | ||
| Context: XmtpSharedContext + 'static, | ||
| { | ||
| async fn run(&mut self) -> Result<(), KeyPackagesCleanerError> { | ||
| let (base, jitter) = self | ||
| .context | ||
| .worker_interval(WorkerKind::KeyPackageCleaner, INTERVAL_DURATION); | ||
| let mut intervals = xmtp_common::time::jittered_interval_stream(base, jitter); | ||
| while (intervals.next().await).is_some() { | ||
| self.tick().await?; | ||
| let receiver = self.context.key_package_channels().receiver.clone(); | ||
| let mut receiver = receiver.lock().await; | ||
| loop { | ||
| // Drain any pending re-arm signals before computing the plan so we | ||
| // don't lose a wakeup that arrived while we were working. | ||
| while receiver.try_recv().is_ok() {} | ||
|
|
||
| let db = self.context.db(); | ||
| let next_rotation = db | ||
| .next_key_package_rotation_ns() | ||
| .map_err(KeyPackagesCleanerError::Metadata)?; | ||
| match plan( | ||
| next_rotation, | ||
| db.min_key_package_delete_at_ns() | ||
| .map_err(KeyPackagesCleanerError::Metadata)?, | ||
| xmtp_common::time::now_ns(), | ||
| ) { | ||
| WakePlan::RunNow => { | ||
| self.maintain(next_rotation).await?; | ||
| } | ||
| WakePlan::SleepUntil(deadline) => { | ||
| let dur = std::time::Duration::from_nanos( | ||
| (deadline - xmtp_common::time::now_ns()).max(0) as u64, | ||
| ); | ||
| tokio::select! { | ||
| // Re-arm signal: recompute the deadline, do NOT run work. | ||
| // `None` means every sender was dropped (context torn down) — | ||
| // stop rather than busy-spin on a closed channel. | ||
| msg = receiver.recv() => { if msg.is_none() { return Ok(()); } } | ||
| // Deadline elapsed: time to do maintenance. | ||
| () = xmtp_common::time::sleep(dur) => { | ||
| self.maintain(next_rotation).await?; | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
What's preventing this whole worker from being dropped and getting implemented as a generic task? Anytime the key package gets rotated it could get rescheduled and then when it actually kicks off it would end up rescheduling itself.
There was a problem hiding this comment.
I have an implementation here that does this: #3795 but it touches more code, and there are some tricky edge cases with getting the worker to wake within X sec. time of getting a welcome, while also juggling deadlines for key package rotation and deletion
There was a problem hiding this comment.
it's also entirely possible I need to rethink this KP stuff entirely. it seems we actually want two isolated worker tasks, key package cleaning, and key package rotation they don't necessarily have to be coupled as they are now. maybe that gives way to some other race conditions though but i'll investigate.
Draft — alternative to #3788 implementing @tylerhawkes's review feedback. Opening for review, not yet ready to merge.
What
Replaces KeyPackageCleaner's 5-second poll with a worker that sleeps until the exact next rotation deadline (
next_key_package_rotation_ns) — no fixed-interval poll, so sleeping apps aren't woken for work whose time we already know.This is the approach from the #3788 review:
Why
The 5s poll cost ~82M empty
worker_turnspans/day (99.9994% ofworker_turnvolume) + ~2,000 no-op DB queries/sec at 5,000 clients/process. Real work is ~monthly (30-day rotation; deletion ~1 day after a KP is superseded).How
xmtp_common::time::sleepis now wasm-safe — it chunks long durations internally so a multi-day sleep doesn't overflowgloo-timers(which casts ms toi32for JSsetTimeout; a 30-day sleep would otherwise wrap and fire immediately). Benefits every long sleep in the codebase.next_key_package_rotation_ns. A pureWakePlan { RunNow, SleepUntil(deadline) }decides; onRunNow(NULL/past deadline) it runs maintenance, onSleepUntilitselect!s a re-arm-channel recompute vssleep(deadline - now) → maintain(). Existing code already rolls the rotation deadline 30 days forward after each rotation, so the worker self-perpetuates. Deletion of expired KPs is opportunistic in the work pass (latency-tolerant; local-only with a grace).Client::queue_key_rotation(welcome + public) lowers the deadline tonow+5sand nudges the worker via awake_key_package_worker()facade → the parked worker recomputes and rotates ~5s later.maintain()— oneworker_turnspan only when there's real work, wrapping it so failures are recorded.WorkerKind::KeyPackageCleanerand registration are kept (no bindings break); supervisor restart is the failure backstop.Relationship to #3788
#3788 is the 1-hour-fallback version of this worker; this is the exact-deadline version (no fallback poll). If #3788 lands first, this becomes a small follow-up swapping the interval for the deadline sleep.
MLS safety
Last-resort reusable KPs; on-network expiry ~90d vs 30d rotation = 60-day window; deletion local-only where late deletion is strictly safe. A worker waking at the rotation deadline (or any time in the window) is safe.
🤖 Generated with Claude Code
Note
Replace fixed-interval polling in
KeyPackageCleanerwith deadline-driven sleep and rearm channelKeyPackageCleanerworker now queries the soonest rotation and deletion deadlines from the DB and sleeps precisely until that deadline, replacing a fixed-interval poll loop.RearmChannel(capacity-1 mpsc) allows callers to wake the worker early;queue_key_rotationandprocess_new_welcomenow signal it immediately after queuing work.WakePlanenum and pureplan()function encapsulate the deadline-selection logic, making it independently testable.i32::MAXms) are chunked into 1-day pieces via a newsleep_chunkshelper to avoid overflow-induced immediate timeouts.next_key_package_rotation_ns()onidentityandmin_key_package_delete_at_ns()onkey_package_history.Macroscope summarized 8bbf20a.