Skip to content

Fix multi-device clock synchronization (sanity-gate, discipline, consensus)#186

Merged
cboulay merged 9 commits into
masterfrom
185-multi-device-clock-sync
Jun 20, 2026
Merged

Fix multi-device clock synchronization (sanity-gate, discipline, consensus)#186
cboulay merged 9 commits into
masterfrom
185-multi-device-clock-sync

Conversation

@cboulay

@cboulay cboulay commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Fixes #185.

Problem

On a multi-device Gemini rig (NSP + HUB1 + HUB2 sharing one PTP clock) the device(PTP)→host time mapping was unreliable: the NSP could intermittently latch a bad/stale peer offset and report host timestamps as the raw device/PTP clock (~1.77e18 ns), and even when it didn't, devices sharing one clock disagreed by multiple ms on cold start.

Root causes (all in the offset-commit path):

  • An externally-borrowed peer offset was adopted with unconditional priority and no sanity bound, and once set was never re-evaluated (latched for the session).
  • The SDK borrowed from the first HUB that opened, once, and stopped refreshing after adoption.
  • probeSpreadOk() reported 1–2 probes as "reliable".
  • The committed offset jumped to whatever the latest sample selected, so jitter/transients skewed it.

Fix (per-device → cross-device)

  1. Sanity-gate borrowed offsets + require ≥3 probes — an external offset is a candidate, adopted only if consistent with the device's own evidence and re-checked every sample (can't latch); a stale/wrong-epoch value is rejected.
  2. Borrow from all HUBs, refreshed — lowest-uncertainty wins, retried each cycle.
  3. Commit discipline — deadband (snap <1 ms), slew (≤50 ms), step-with-persistence (large), plus a stepout backstop for a stuck offset with no peer to break the tie.
  4. Median consensus across the shared clock — each Gemini device votes its own independent estimate; the median outvotes a transiently-biased device. Restricted to Gemini devices (LegacyNSP/nPlay have independent clocks).
  5. CLIENT vs peer publish splitclock_offset_ns carries the consensus value for CLIENT readers; a new clock_raw_offset_ns carries the independent estimate for peer voting (so the median still tracks common-mode drift).
  6. pycbsdk re-calibrationdevice_to_monotonic re-measures the monotonic↔steady offset on drift (cheap 1 Hz probe, full recal >1 ms) to survive host sleep on macOS.

Results

Live NSP + HUB1 + HUB2, cold starts with STANDBY between iterations:

Stage Worst cold-start cross-device disagreement
Before up to 269 ms, plateaus, raw-clock latch
Per-device discipline mostly <0.1 ms, intermittent ~2.5 ms source bias
+ consensus 0.0000 ms (21/21 iterations across runs), 0 raw-clock leaks

Tests

  • Unit (always-on): ClockSync invariants incl. RED→GREEN bug-pinning tests (no raw-clock leak, external-offset sanity, latch recovery, ≥3-probe reliability, discipline + stepout); cross-device consensus over the real ShmemSession exchange; shmem consensus/raw field separation; pycbsdk recalibration (drift / within-threshold / probe-gate). cbshm 101, cbdev 31, clock_sync 18, multidevice 6, pycbsdk recal 3 — all green.
  • Hardware-gated live canary (tests/integration/test_multidevice_clock_sync.cpp, CERELINK_HW_MULTIDEVICE=1): rotates bring-up order, samples across the cold-start transition, asserts no raw-clock leak at every sample and ≤1 ms cross-device agreement after settle.

Notes / known

  • Single-device nPlay clock integration tests flake on the marginal 3 s sync wait — pre-existing, not from this change (failure set varies between runs; nPlay's clock path is unchanged here).
  • Native shmem layout gains clock_raw_offset_ns/clock_raw_valid — internal NATIVE layout only.

🤖 Generated with Claude Code

cboulay and others added 8 commits June 19, 2026 18:04
A multi-device rig (Gemini NSP + HUBs) shares one PTP device clock. The NSP,
whose own probe timing is unreliable, borrows a peer HUB's offset via shared
memory. Previously that external offset took unconditional priority in
ClockSync and, once set, was never re-evaluated, so a stale/wrong-epoch peer
value latched and made the NSP report host timestamps as the raw device clock.

Treat an external offset as a candidate: adopt it only when it agrees with the
device's own probe/data evidence (within max(1s, 4x internal uncertainty)), and
re-check it on every probe/data sample so it cannot latch to a stale value. An
implausible external offset is rejected and the internal estimate is used.
Also require at least 3 probes before the probe path is reported reliable; a
single or wildly inconsistent pair of probes no longer counts as reliable.

Add unit tests for the device->host mapping invariants (no raw-clock leak,
data-fallback bounds, external-offset sanity, latch recovery, probe
reliability), cross-device consistency over the shared-clock topology driven
through the real ShmemSession exchange, and a hardware-gated live canary for
the NSP + HUB1 + HUB2 rig.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The NSP previously borrowed a peer HUB's clock offset from the first HUB
segment that opened, tried only once, and stopped refreshing once its own
reported uncertainty changed after adoption — so a HUB that came up later, or a
HUB offset that went stale, was never reconsidered.

Open every HUB config segment (retrying ones not yet available), read each
HUB's published offset every cycle, and inject the lowest-uncertainty one.
ClockSync sanity-checks the borrowed offset against the NSP's own data
evidence, so a stale or implausible HUB value is ignored instead of latched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The committed device->host offset previously jumped to whatever the latest
probe/data sample selected, so a noisy or transient estimate moved it by several
ms and devices sharing one PTP clock disagreed by up to tens of ms on a cold
start.

Run the internal (probe/data) estimate through a commit discipline: a change
within a 1 ms deadband is applied as-is, a larger change up to 50 ms is slewed a
fraction per sample, and a change beyond that is treated as a step accepted only
after it persists, so a one-off outlier no longer moves the offset. A plausible
peer offset stays authoritative and is adopted directly; a change of source
(peer<->internal) or a detected clock wrap re-acquires immediately rather than
slewing.

On the live NSP + HUB1 + HUB2 rig this collapses cold-start cross-device
disagreement from ~100-270 ms (with multi-ms spikes and plateaus) to under
0.1 ms across the whole convergence window. The live test now enforces the 1 ms
cross-device bound as a steady-state requirement (after an 8 s settle), while
the no-raw-clock-leak check still applies at every sample.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The commit discipline could hold a committed offset off-target indefinitely if
a large jump never reached the step-persistence count (e.g. an estimate
oscillating around the slew/step boundary) and no peer was available to break
the tie. Add a stepout: after the committed offset stays beyond the deadband
for stepout_samples consecutive samples, re-acquire to the current estimate.

Add unit tests for the discipline paths (slew damping, step-after-persistence,
and the stepout re-acquire), driven deterministically through a tailored Config.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A single device's offset estimate can be transiently source-biased (e.g. a HUB
whose cold-start probe RTT asymmetry skews the midpoint by a few ms for several
seconds). The per-device discipline faithfully tracks that biased source, so
devices sharing one PTP clock could still disagree by multiple ms.

Each Gemini device now publishes its own independent estimate and, every cycle,
takes the median of its estimate plus every peer's published estimate as its
committed offset. With >=3 participants the median outvotes a single biased
device; all participants read the same set of estimates and converge on the
same value, so cross-device time conversion agrees. Voting on the independent
(not the consensus) estimate keeps the median able to track real common-mode
drift. With fewer than 3 participants an NSP still borrows a HUB's offset and
other devices keep their own. nPlay/custom devices have no PTP peers and skip
consensus entirely.

Expose ClockSync::getInternalOffsetNs (own-evidence estimate, pre-consensus)
through IDeviceSession for the independent vote.

On the live NSP + HUB1 + HUB2 rig this holds cross-device agreement at ~0 ms
across cold starts (15/15 iterations), eliminating the intermittent multi-ms
source-bias transients.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Only Gemini devices (NSP + HUBs) share one PTP clock. Legacy NSP and nPlay have
independent clocks, so their offsets must not be combined with a HUB's via
consensus or borrow. Drop LEGACY_NSP from the consensus gate and the borrow
fallback so only Gemini devices participate; legacy/nPlay/custom keep their own
estimate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CLIENT-mode consumers read clock_offset_ns for time conversion, but with cross-
device consensus that field carried each device's own (possibly transiently
biased) estimate, so a CLIENT could see a different offset than the device used.

Add a separate clock_raw_offset_ns field to the native config segment: peers
vote on it for consensus (so the median can still track common-mode drift),
while clock_offset_ns now publishes the committed (post-consensus) value for
CLIENT readers. Live NSP + HUB1 + HUB2 cross-device agreement is unchanged
(~0 ms across cold starts).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
device_to_monotonic() chains the device->steady offset with a monotonic<->steady
offset that was measured only once at session creation. Those two host clocks
can diverge over a long session — most notably across host sleep on macOS, where
steady_clock (mach_continuous_time) advances during sleep but time.monotonic()
(mach_absolute_time) does not — leaving the conversion stale.

Elapsed monotonic time cannot detect sleep, so take a cheap one-sample probe of
the current offset at most once per second and run a full re-calibration only
when it has drifted past 1 ms. Add hardware-free unit tests for the drift,
within-threshold, and probe-gate paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The peer raw-offset reader was added only to the POSIX branch, breaking the
Windows build (PeerClockReader has no getRawOffsetNs there). Add the matching
no-op stub.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cboulay cboulay merged commit cd14a57 into master Jun 20, 2026
24 of 26 checks passed
@cboulay cboulay deleted the 185-multi-device-clock-sync branch June 20, 2026 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-device clock sync: unbounded/glitchy device→host offset (raw-clock latch + cross-device instability)

1 participant