Fix multi-device clock synchronization (sanity-gate, discipline, consensus)#186
Merged
Conversation
A multi-device rig (Gemini NSP + HUBs) shares one PTP device clock. The NSP, whose own probe timing is unreliable, borrows a peer HUB's offset via shared memory. Previously that external offset took unconditional priority in ClockSync and, once set, was never re-evaluated, so a stale/wrong-epoch peer value latched and made the NSP report host timestamps as the raw device clock. Treat an external offset as a candidate: adopt it only when it agrees with the device's own probe/data evidence (within max(1s, 4x internal uncertainty)), and re-check it on every probe/data sample so it cannot latch to a stale value. An implausible external offset is rejected and the internal estimate is used. Also require at least 3 probes before the probe path is reported reliable; a single or wildly inconsistent pair of probes no longer counts as reliable. Add unit tests for the device->host mapping invariants (no raw-clock leak, data-fallback bounds, external-offset sanity, latch recovery, probe reliability), cross-device consistency over the shared-clock topology driven through the real ShmemSession exchange, and a hardware-gated live canary for the NSP + HUB1 + HUB2 rig. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The NSP previously borrowed a peer HUB's clock offset from the first HUB segment that opened, tried only once, and stopped refreshing once its own reported uncertainty changed after adoption — so a HUB that came up later, or a HUB offset that went stale, was never reconsidered. Open every HUB config segment (retrying ones not yet available), read each HUB's published offset every cycle, and inject the lowest-uncertainty one. ClockSync sanity-checks the borrowed offset against the NSP's own data evidence, so a stale or implausible HUB value is ignored instead of latched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The committed device->host offset previously jumped to whatever the latest probe/data sample selected, so a noisy or transient estimate moved it by several ms and devices sharing one PTP clock disagreed by up to tens of ms on a cold start. Run the internal (probe/data) estimate through a commit discipline: a change within a 1 ms deadband is applied as-is, a larger change up to 50 ms is slewed a fraction per sample, and a change beyond that is treated as a step accepted only after it persists, so a one-off outlier no longer moves the offset. A plausible peer offset stays authoritative and is adopted directly; a change of source (peer<->internal) or a detected clock wrap re-acquires immediately rather than slewing. On the live NSP + HUB1 + HUB2 rig this collapses cold-start cross-device disagreement from ~100-270 ms (with multi-ms spikes and plateaus) to under 0.1 ms across the whole convergence window. The live test now enforces the 1 ms cross-device bound as a steady-state requirement (after an 8 s settle), while the no-raw-clock-leak check still applies at every sample. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The commit discipline could hold a committed offset off-target indefinitely if a large jump never reached the step-persistence count (e.g. an estimate oscillating around the slew/step boundary) and no peer was available to break the tie. Add a stepout: after the committed offset stays beyond the deadband for stepout_samples consecutive samples, re-acquire to the current estimate. Add unit tests for the discipline paths (slew damping, step-after-persistence, and the stepout re-acquire), driven deterministically through a tailored Config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A single device's offset estimate can be transiently source-biased (e.g. a HUB whose cold-start probe RTT asymmetry skews the midpoint by a few ms for several seconds). The per-device discipline faithfully tracks that biased source, so devices sharing one PTP clock could still disagree by multiple ms. Each Gemini device now publishes its own independent estimate and, every cycle, takes the median of its estimate plus every peer's published estimate as its committed offset. With >=3 participants the median outvotes a single biased device; all participants read the same set of estimates and converge on the same value, so cross-device time conversion agrees. Voting on the independent (not the consensus) estimate keeps the median able to track real common-mode drift. With fewer than 3 participants an NSP still borrows a HUB's offset and other devices keep their own. nPlay/custom devices have no PTP peers and skip consensus entirely. Expose ClockSync::getInternalOffsetNs (own-evidence estimate, pre-consensus) through IDeviceSession for the independent vote. On the live NSP + HUB1 + HUB2 rig this holds cross-device agreement at ~0 ms across cold starts (15/15 iterations), eliminating the intermittent multi-ms source-bias transients. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Only Gemini devices (NSP + HUBs) share one PTP clock. Legacy NSP and nPlay have independent clocks, so their offsets must not be combined with a HUB's via consensus or borrow. Drop LEGACY_NSP from the consensus gate and the borrow fallback so only Gemini devices participate; legacy/nPlay/custom keep their own estimate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CLIENT-mode consumers read clock_offset_ns for time conversion, but with cross- device consensus that field carried each device's own (possibly transiently biased) estimate, so a CLIENT could see a different offset than the device used. Add a separate clock_raw_offset_ns field to the native config segment: peers vote on it for consensus (so the median can still track common-mode drift), while clock_offset_ns now publishes the committed (post-consensus) value for CLIENT readers. Live NSP + HUB1 + HUB2 cross-device agreement is unchanged (~0 ms across cold starts). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
device_to_monotonic() chains the device->steady offset with a monotonic<->steady offset that was measured only once at session creation. Those two host clocks can diverge over a long session — most notably across host sleep on macOS, where steady_clock (mach_continuous_time) advances during sleep but time.monotonic() (mach_absolute_time) does not — leaving the conversion stale. Elapsed monotonic time cannot detect sleep, so take a cheap one-sample probe of the current offset at most once per second and run a full re-calibration only when it has drifted past 1 ms. Add hardware-free unit tests for the drift, within-threshold, and probe-gate paths. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The peer raw-offset reader was added only to the POSIX branch, breaking the Windows build (PeerClockReader has no getRawOffsetNs there). Add the matching no-op stub. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #185.
Problem
On a multi-device Gemini rig (NSP + HUB1 + HUB2 sharing one PTP clock) the device(PTP)→host time mapping was unreliable: the NSP could intermittently latch a bad/stale peer offset and report host timestamps as the raw device/PTP clock (~1.77e18 ns), and even when it didn't, devices sharing one clock disagreed by multiple ms on cold start.
Root causes (all in the offset-commit path):
probeSpreadOk()reported 1–2 probes as "reliable".Fix (per-device → cross-device)
clock_offset_nscarries the consensus value for CLIENT readers; a newclock_raw_offset_nscarries the independent estimate for peer voting (so the median still tracks common-mode drift).device_to_monotonicre-measures the monotonic↔steady offset on drift (cheap 1 Hz probe, full recal >1 ms) to survive host sleep on macOS.Results
Live NSP + HUB1 + HUB2, cold starts with STANDBY between iterations:
Tests
ShmemSessionexchange; shmem consensus/raw field separation; pycbsdk recalibration (drift / within-threshold / probe-gate). cbshm 101, cbdev 31, clock_sync 18, multidevice 6, pycbsdk recal 3 — all green.tests/integration/test_multidevice_clock_sync.cpp,CERELINK_HW_MULTIDEVICE=1): rotates bring-up order, samples across the cold-start transition, asserts no raw-clock leak at every sample and ≤1 ms cross-device agreement after settle.Notes / known
clock_raw_offset_ns/clock_raw_valid— internal NATIVE layout only.🤖 Generated with Claude Code