Skip to content

feat(xmtp_mls): KeyPackageCleaner sleeps to the exact rotation deadline (no poll)#3791

Draft
insipx wants to merge 4 commits into
mainfrom
insipx/kp-exact-deadline-worker
Draft

feat(xmtp_mls): KeyPackageCleaner sleeps to the exact rotation deadline (no poll)#3791
insipx wants to merge 4 commits into
mainfrom
insipx/kp-exact-deadline-worker

Conversation

@insipx

@insipx insipx commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Draft — alternative to #3788 implementing @tylerhawkes's review feedback. Opening for review, not yet ready to merge.

What

Replaces KeyPackageCleaner's 5-second poll with a worker that sleeps until the exact next rotation deadline (next_key_package_rotation_ns) — no fixed-interval poll, so sleeping apps aren't woken for work whose time we already know.

This is the approach from the #3788 review:

"This can all be done on the existing worker with a specific timeout in the future instead of regular polling. If they expire after 90 days, we update the task to roll it to 30 days out every time we create one — as long as the app wakes up sometime in the 60-day window it gets rolled."

Why

The 5s poll cost ~82M empty worker_turn spans/day (99.9994% of worker_turn volume) + ~2,000 no-op DB queries/sec at 5,000 clients/process. Real work is ~monthly (30-day rotation; deletion ~1 day after a KP is superseded).

How

  • xmtp_common::time::sleep is now wasm-safe — it chunks long durations internally so a multi-day sleep doesn't overflow gloo-timers (which casts ms to i32 for JS setTimeout; a 30-day sleep would otherwise wrap and fire immediately). Benefits every long sleep in the codebase.
  • The worker sleeps to next_key_package_rotation_ns. A pure WakePlan { RunNow, SleepUntil(deadline) } decides; on RunNow (NULL/past deadline) it runs maintenance, on SleepUntil it select!s a re-arm-channel recompute vs sleep(deadline - now) → maintain(). Existing code already rolls the rotation deadline 30 days forward after each rotation, so the worker self-perpetuates. Deletion of expired KPs is opportunistic in the work pass (latency-tolerant; local-only with a grace).
  • The ~5s welcome-queued rotation still fires promptly: Client::queue_key_rotation (welcome + public) lowers the deadline to now+5s and nudges the worker via a wake_key_package_worker() facade → the parked worker recomputes and rotates ~5s later.
  • Span-gated maintain() — one worker_turn span only when there's real work, wrapping it so failures are recorded.

WorkerKind::KeyPackageCleaner and registration are kept (no bindings break); supervisor restart is the failure backstop.

Relationship to #3788

#3788 is the 1-hour-fallback version of this worker; this is the exact-deadline version (no fallback poll). If #3788 lands first, this becomes a small follow-up swapping the interval for the deadline sleep.

MLS safety

Last-resort reusable KPs; on-network expiry ~90d vs 30d rotation = 60-day window; deletion local-only where late deletion is strictly safe. A worker waking at the rotation deadline (or any time in the window) is safe.

🤖 Generated with Claude Code

Note

Replace fixed-interval polling in KeyPackageCleaner with deadline-driven sleep and rearm channel

  • The KeyPackageCleaner worker now queries the soonest rotation and deletion deadlines from the DB and sleeps precisely until that deadline, replacing a fixed-interval poll loop.
  • A new RearmChannel (capacity-1 mpsc) allows callers to wake the worker early; queue_key_rotation and process_new_welcome now signal it immediately after queuing work.
  • A WakePlan enum and pure plan() function encapsulate the deadline-selection logic, making it independently testable.
  • On WASM, long sleeps (>JS i32::MAX ms) are chunked into 1-day pieces via a new sleep_chunks helper to avoid overflow-induced immediate timeouts.
  • Two new DB queries are added: next_key_package_rotation_ns() on identity and min_key_package_delete_at_ns() on key_package_history.

Macroscope summarized 8bbf20a.

@claude

claude Bot commented Jun 27, 2026

Copy link
Copy Markdown

Claude finished @insipx's task —— View job


Code Review Complete

Review date: 2026-06-30 15:50 UTC

Summary

This is a solid architectural improvement that eliminates ~82M unnecessary wake cycles per day. The implementation is well-structured with good abstractions (RearmChannel, WakePlan), comprehensive test coverage, and proper handling of both native and WASM environments. I found a few minor issues worth addressing before merge.


Issues Found

1. Race condition in receiver lock acquisition (key_package_cleaner.rs:140-141)

The worker acquires the receiver lock once at startup and holds it for the entire lifetime of the loop. If another task (e.g., a test or monitoring code) tries to lock the receiver, it will block indefinitely even though the worker only needs the lock during recv() operations.

Fix: Acquire the lock per-iteration instead of holding it across the loop:

loop {
    let mut receiver = self.context.key_package_channels().receiver.lock().await;
    while receiver.try_recv().is_ok() {}
    drop(receiver); // Release lock before doing work
    
    let db = self.context.db();
    // ... rest of logic
}

2. Potential integer underflow in sleep duration (key_package_cleaner.rs:161-163)

If the system clock moves backward or there's a delay between computing the deadline and calling now_ns(), the subtraction could produce a negative value. While .max(0) catches this, casting a negative i64 to u64 first will wrap to a huge value before the max.

Fix: Do the comparison before the cast:

let dur = std::time::Duration::from_nanos(
    std::cmp::max(0, deadline - xmtp_common::time::now_ns()) as u64
);

3. Missing error context in maintain() (key_package_cleaner.rs:184-217)

The maintain() function creates a span but doesn't log when it exits early with Ok(()) at line 198 (no work needed). This makes it hard to distinguish between "worker woke but nothing to do" vs "worker didn't wake" when debugging.

Suggestion: Add a trace-level log when returning early:

if expired.is_empty() && !rotate_due {
    tracing::trace!("No maintenance needed (no expired packages, rotation not due)");
    return Ok(());
}

Minor Observations

RearmChannel design (rearm_channel.rs)

  • The capacity-1 lossy behavior is appropriate for this use case and well-documented
  • Consider making the receiver field private and adding a lock_receiver() method to enforce encapsulation, but current design is acceptable for internal use

WASM sleep chunking (time.rs:88-110)

  • Excellent fix for the JS setTimeout i32 overflow issue
  • The 1-day chunk size is a good balance (small enough to avoid i32 overflow, large enough to minimize loop overhead)
  • Test coverage for sleep_chunks() is thorough

Test coverage

  • Integration test queue_key_rotation_wakes_worker_and_rotates() provides good end-to-end validation
  • Unit tests for plan() function cover all edge cases well
  • 94.03% patch coverage is strong; the 12 missing lines appear to be error paths and edge cases

Performance & Security

Performance: Eliminates 2,000 DB queries/sec and 82M spans/day - excellent optimization
Security: No secrets exposed; key package deletion is local-only with appropriate grace period
Correctness: The 60-day window between rotation (30d) and on-network expiry (90d) provides ample safety margin


Recommendation

Approve after addressing issue #1 (race condition). Issues #2 and #3 are optional improvements that would increase robustness but aren't blockers.


Comment thread crates/xmtp_mls/src/worker/key_package_cleaner.rs
insipx added 2 commits June 26, 2026 23:42
…ge_rotation_ns, min_key_package_delete_at_ns)
…_package_worker; nudge after queue_key_rotation
@insipx insipx force-pushed the insipx/kp-exact-deadline-worker branch from 3eac7ed to 282a725 Compare June 27, 2026 03:42
@codecov

codecov Bot commented Jun 27, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 93.96985% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.55%. Comparing base (82a7c3b) to head (8bbf20a).
⚠️ Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
crates/xmtp_db/src/encrypted_store/identity.rs 72.72% 3 Missing ⚠️
...xmtp_db/src/encrypted_store/key_package_history.rs 75.00% 3 Missing ⚠️
crates/xmtp_mls/src/worker/key_package_cleaner.rs 97.39% 3 Missing ⚠️
crates/xmtp_mls/src/worker/rearm_channel.rs 88.88% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3791      +/-   ##
==========================================
+ Coverage   84.40%   84.55%   +0.15%     
==========================================
  Files         409      411       +2     
  Lines       60138    61146    +1008     
==========================================
+ Hits        50759    51702     +943     
- Misses       9379     9444      +65     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@insipx insipx marked this pull request as ready for review June 30, 2026 14:15
@insipx insipx requested a review from a team as a code owner June 30, 2026 14:15
@macroscopeapp

macroscopeapp Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Approvability

Verdict: Needs human review

This PR fundamentally changes the KeyPackageCleaner worker from polling every 5 seconds to deadline-based sleeping (potentially for days). While the author owns all modified files and the code is well-tested, the significant change to this worker's execution model and the importance of key package rotation for messaging security warrants human review.

You can customize Macroscope's approvability policy. Learn more.

@insipx insipx force-pushed the insipx/kp-exact-deadline-worker branch from 282a725 to 8bbf20a Compare June 30, 2026 15:49
/// JS `setTimeout` (via gloo-timers) casts the millisecond value to `i32`,
/// so any duration > ~24.8 days overflows and fires immediately. Chunking into
/// at-most-1-day pieces lets callers sleep the full requested duration.
#[allow(dead_code)] // only called from wasm arm; also exercised by unit tests

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be gated for wasm if that's all it's used for

Comment on lines 135 to +175
impl<Context> KeyPackagesCleanerWorker<Context>
where
Context: XmtpSharedContext + 'static,
{
async fn run(&mut self) -> Result<(), KeyPackagesCleanerError> {
let (base, jitter) = self
.context
.worker_interval(WorkerKind::KeyPackageCleaner, INTERVAL_DURATION);
let mut intervals = xmtp_common::time::jittered_interval_stream(base, jitter);
while (intervals.next().await).is_some() {
self.tick().await?;
let receiver = self.context.key_package_channels().receiver.clone();
let mut receiver = receiver.lock().await;
loop {
// Drain any pending re-arm signals before computing the plan so we
// don't lose a wakeup that arrived while we were working.
while receiver.try_recv().is_ok() {}

let db = self.context.db();
let next_rotation = db
.next_key_package_rotation_ns()
.map_err(KeyPackagesCleanerError::Metadata)?;
match plan(
next_rotation,
db.min_key_package_delete_at_ns()
.map_err(KeyPackagesCleanerError::Metadata)?,
xmtp_common::time::now_ns(),
) {
WakePlan::RunNow => {
self.maintain(next_rotation).await?;
}
WakePlan::SleepUntil(deadline) => {
let dur = std::time::Duration::from_nanos(
(deadline - xmtp_common::time::now_ns()).max(0) as u64,
);
tokio::select! {
// Re-arm signal: recompute the deadline, do NOT run work.
// `None` means every sender was dropped (context torn down) —
// stop rather than busy-spin on a closed channel.
msg = receiver.recv() => { if msg.is_none() { return Ok(()); } }
// Deadline elapsed: time to do maintenance.
() = xmtp_common::time::sleep(dur) => {
self.maintain(next_rotation).await?;
}
}
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's preventing this whole worker from being dropped and getting implemented as a generic task? Anytime the key package gets rotated it could get rescheduled and then when it actually kicks off it would end up rescheduling itself.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an implementation here that does this: #3795 but it touches more code, and there are some tricky edge cases with getting the worker to wake within X sec. time of getting a welcome, while also juggling deadlines for key package rotation and deletion

@insipx insipx Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's also entirely possible I need to rethink this KP stuff entirely. it seems we actually want two isolated worker tasks, key package cleaning, and key package rotation they don't necessarily have to be coupled as they are now. maybe that gives way to some other race conditions though but i'll investigate.

@insipx insipx marked this pull request as draft June 30, 2026 18:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants