Fix multi-device clock synchronization (sanity-gate, discipline, consensus) by cboulay · Pull Request #186 · CerebusOSS/CereLink

cboulay · 2026-06-20T00:35:15Z

Fixes #185.

Problem

On a multi-device Gemini rig (NSP + HUB1 + HUB2 sharing one PTP clock) the device(PTP)→host time mapping was unreliable: the NSP could intermittently latch a bad/stale peer offset and report host timestamps as the raw device/PTP clock (~1.77e18 ns), and even when it didn't, devices sharing one clock disagreed by multiple ms on cold start.

Root causes (all in the offset-commit path):

An externally-borrowed peer offset was adopted with unconditional priority and no sanity bound, and once set was never re-evaluated (latched for the session).
The SDK borrowed from the first HUB that opened, once, and stopped refreshing after adoption.
probeSpreadOk() reported 1–2 probes as "reliable".
The committed offset jumped to whatever the latest sample selected, so jitter/transients skewed it.

Fix (per-device → cross-device)

Sanity-gate borrowed offsets + require ≥3 probes — an external offset is a candidate, adopted only if consistent with the device's own evidence and re-checked every sample (can't latch); a stale/wrong-epoch value is rejected.
Borrow from all HUBs, refreshed — lowest-uncertainty wins, retried each cycle.
Commit discipline — deadband (snap <1 ms), slew (≤50 ms), step-with-persistence (large), plus a stepout backstop for a stuck offset with no peer to break the tie.
Median consensus across the shared clock — each Gemini device votes its own independent estimate; the median outvotes a transiently-biased device. Restricted to Gemini devices (LegacyNSP/nPlay have independent clocks).
CLIENT vs peer publish split — clock_offset_ns carries the consensus value for CLIENT readers; a new clock_raw_offset_ns carries the independent estimate for peer voting (so the median still tracks common-mode drift).
pycbsdk re-calibration — device_to_monotonic re-measures the monotonic↔steady offset on drift (cheap 1 Hz probe, full recal >1 ms) to survive host sleep on macOS.

Results

Live NSP + HUB1 + HUB2, cold starts with STANDBY between iterations:

Stage	Worst cold-start cross-device disagreement
Before	up to 269 ms, plateaus, raw-clock latch
Per-device discipline	mostly <0.1 ms, intermittent ~2.5 ms source bias
+ consensus	0.0000 ms (21/21 iterations across runs), 0 raw-clock leaks

Tests

Unit (always-on): ClockSync invariants incl. RED→GREEN bug-pinning tests (no raw-clock leak, external-offset sanity, latch recovery, ≥3-probe reliability, discipline + stepout); cross-device consensus over the real ShmemSession exchange; shmem consensus/raw field separation; pycbsdk recalibration (drift / within-threshold / probe-gate). cbshm 101, cbdev 31, clock_sync 18, multidevice 6, pycbsdk recal 3 — all green.
Hardware-gated live canary (tests/integration/test_multidevice_clock_sync.cpp, CERELINK_HW_MULTIDEVICE=1): rotates bring-up order, samples across the cold-start transition, asserts no raw-clock leak at every sample and ≤1 ms cross-device agreement after settle.

Notes / known

Single-device nPlay clock integration tests flake on the marginal 3 s sync wait — pre-existing, not from this change (failure set varies between runs; nPlay's clock path is unchanged here).
Native shmem layout gains clock_raw_offset_ns/clock_raw_valid — internal NATIVE layout only.

🤖 Generated with Claude Code

A multi-device rig (Gemini NSP + HUBs) shares one PTP device clock. The NSP, whose own probe timing is unreliable, borrows a peer HUB's offset via shared memory. Previously that external offset took unconditional priority in ClockSync and, once set, was never re-evaluated, so a stale/wrong-epoch peer value latched and made the NSP report host timestamps as the raw device clock. Treat an external offset as a candidate: adopt it only when it agrees with the device's own probe/data evidence (within max(1s, 4x internal uncertainty)), and re-check it on every probe/data sample so it cannot latch to a stale value. An implausible external offset is rejected and the internal estimate is used. Also require at least 3 probes before the probe path is reported reliable; a single or wildly inconsistent pair of probes no longer counts as reliable. Add unit tests for the device->host mapping invariants (no raw-clock leak, data-fallback bounds, external-offset sanity, latch recovery, probe reliability), cross-device consistency over the shared-clock topology driven through the real ShmemSession exchange, and a hardware-gated live canary for the NSP + HUB1 + HUB2 rig. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The NSP previously borrowed a peer HUB's clock offset from the first HUB segment that opened, tried only once, and stopped refreshing once its own reported uncertainty changed after adoption — so a HUB that came up later, or a HUB offset that went stale, was never reconsidered. Open every HUB config segment (retrying ones not yet available), read each HUB's published offset every cycle, and inject the lowest-uncertainty one. ClockSync sanity-checks the borrowed offset against the NSP's own data evidence, so a stale or implausible HUB value is ignored instead of latched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The committed device->host offset previously jumped to whatever the latest probe/data sample selected, so a noisy or transient estimate moved it by several ms and devices sharing one PTP clock disagreed by up to tens of ms on a cold start. Run the internal (probe/data) estimate through a commit discipline: a change within a 1 ms deadband is applied as-is, a larger change up to 50 ms is slewed a fraction per sample, and a change beyond that is treated as a step accepted only after it persists, so a one-off outlier no longer moves the offset. A plausible peer offset stays authoritative and is adopted directly; a change of source (peer<->internal) or a detected clock wrap re-acquires immediately rather than slewing. On the live NSP + HUB1 + HUB2 rig this collapses cold-start cross-device disagreement from ~100-270 ms (with multi-ms spikes and plateaus) to under 0.1 ms across the whole convergence window. The live test now enforces the 1 ms cross-device bound as a steady-state requirement (after an 8 s settle), while the no-raw-clock-leak check still applies at every sample. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The commit discipline could hold a committed offset off-target indefinitely if a large jump never reached the step-persistence count (e.g. an estimate oscillating around the slew/step boundary) and no peer was available to break the tie. Add a stepout: after the committed offset stays beyond the deadband for stepout_samples consecutive samples, re-acquire to the current estimate. Add unit tests for the discipline paths (slew damping, step-after-persistence, and the stepout re-acquire), driven deterministically through a tailored Config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A single device's offset estimate can be transiently source-biased (e.g. a HUB whose cold-start probe RTT asymmetry skews the midpoint by a few ms for several seconds). The per-device discipline faithfully tracks that biased source, so devices sharing one PTP clock could still disagree by multiple ms. Each Gemini device now publishes its own independent estimate and, every cycle, takes the median of its estimate plus every peer's published estimate as its committed offset. With >=3 participants the median outvotes a single biased device; all participants read the same set of estimates and converge on the same value, so cross-device time conversion agrees. Voting on the independent (not the consensus) estimate keeps the median able to track real common-mode drift. With fewer than 3 participants an NSP still borrows a HUB's offset and other devices keep their own. nPlay/custom devices have no PTP peers and skip consensus entirely. Expose ClockSync::getInternalOffsetNs (own-evidence estimate, pre-consensus) through IDeviceSession for the independent vote. On the live NSP + HUB1 + HUB2 rig this holds cross-device agreement at ~0 ms across cold starts (15/15 iterations), eliminating the intermittent multi-ms source-bias transients. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Only Gemini devices (NSP + HUBs) share one PTP clock. Legacy NSP and nPlay have independent clocks, so their offsets must not be combined with a HUB's via consensus or borrow. Drop LEGACY_NSP from the consensus gate and the borrow fallback so only Gemini devices participate; legacy/nPlay/custom keep their own estimate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

CLIENT-mode consumers read clock_offset_ns for time conversion, but with cross- device consensus that field carried each device's own (possibly transiently biased) estimate, so a CLIENT could see a different offset than the device used. Add a separate clock_raw_offset_ns field to the native config segment: peers vote on it for consensus (so the median can still track common-mode drift), while clock_offset_ns now publishes the committed (post-consensus) value for CLIENT readers. Live NSP + HUB1 + HUB2 cross-device agreement is unchanged (~0 ms across cold starts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

device_to_monotonic() chains the device->steady offset with a monotonic<->steady offset that was measured only once at session creation. Those two host clocks can diverge over a long session — most notably across host sleep on macOS, where steady_clock (mach_continuous_time) advances during sleep but time.monotonic() (mach_absolute_time) does not — leaving the conversion stale. Elapsed monotonic time cannot detect sleep, so take a cheap one-sample probe of the current offset at most once per second and run a full re-calibration only when it has drifted past 1 ms. Add hardware-free unit tests for the drift, within-threshold, and probe-gate paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The peer raw-offset reader was added only to the POSIX branch, breaking the Windows build (PeerClockReader has no getRawOffsetNs there). Add the matching no-op stub. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

cboulay and others added 8 commits June 19, 2026 18:04

cboulay linked an issue Jun 20, 2026 that may be closed by this pull request

Multi-device clock sync: unbounded/glitchy device→host offset (raw-clock latch + cross-device instability) #185

Closed

cboulay merged commit cd14a57 into master Jun 20, 2026
24 of 26 checks passed

cboulay deleted the 185-multi-device-clock-sync branch June 20, 2026 02:40

cboulay mentioned this pull request Jun 20, 2026

Fix non-Gemini (nPlay) clock sync: convert tick timestamps to ns in the data-packet fallback #187

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix multi-device clock synchronization (sanity-gate, discipline, consensus)#186

Fix multi-device clock synchronization (sanity-gate, discipline, consensus)#186
cboulay merged 9 commits into
masterfrom
185-multi-device-clock-sync

cboulay commented Jun 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cboulay commented Jun 20, 2026

Problem

Fix (per-device → cross-device)

Results

Tests

Notes / known

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant