Skip to content

perf: device-task hot-path cleanups and dedicated rclrs spin thread#13

Merged
gsokoll merged 5 commits into
masterfrom
perf/rclrs-spin-thread
May 13, 2026
Merged

perf: device-task hot-path cleanups and dedicated rclrs spin thread#13
gsokoll merged 5 commits into
masterfrom
perf/rclrs-spin-thread

Conversation

@gsokoll

@gsokoll gsokoll commented May 13, 2026

Copy link
Copy Markdown
Collaborator

Four perf commits plus a CHANGELOG entry for v0.2.0. Reduces release binary size by ~22% and steady-state CPU by 35–50% across all five user-facing modes. No public API or topic-content changes.

Commits

  • dafab66 perf: tactical CPU/memory optimisations (release profile, atomics, buffers)
  • ba01b89 perf: drop internal-only UBX payloads from the device channel
  • c3f7a05 perf: eliminate remaining hot-path clones in the device task
  • d52a97b perf: run the rclrs executor on a dedicated thread instead of polling
  • 642ce62 docs: v0.2.0 CHANGELOG entry

What changed

Device-task hot path (dafab66, ba01b89, c3f7a05). An explicit [profile.release] block (fat LTO, codegen-units = 1, strip = "debuginfo") accounts for the binary size reduction. The clone-elimination commits remove ~9 per-epoch heap operations at 5 Hz worst-case load: internal-only UBX variants (NAV-COV, NAV-POSECEF, SEC-SIGLOG, RXM-COR, MON-COMMS/HW/RF, FixTypeChanged) no longer cross the mpsc channel; the remaining heavy payloads (SatInfo, SecSig, RelPosNed, SurveyIn) are moved into the channel send rather than cloned; seven write-only retained-message fields in IntegrityAggregator were removed. Other tactical: Vec::with_capacity in the UBX parsers, AtomicU64/U32 counter relocation out from under the async Mutex in DeviceTask/NtripTask, serial read buffer 1 KiB → 4 KiB, tokio features trimmed from "full" to ["rt-multi-thread", "macros", "sync", "time", "signal"].

Dedicated rclrs spin thread (d52a97b). Replaces the polling loop driving the rclrs executor (executor.spin(SpinOptions::spin_once().timeout(100ms)) + sleep(10ms) on the main tokio thread) with a dedicated OS thread running spin(SpinOptions::default()), which blocks in rcl_wait and returns the instant a callback is ready. Shutdown uses SpinOptions::until_promise_resolved (the rclrs mechanism that flips halt_spinning and triggers the wait set's guard conditions to wake), wired to a tokio oneshot dropped on shutdown. Effect: the main tokio thread no longer wakes ~9 times per second to drain an empty wait set, and ROS control-plane callbacks (parameter events, service calls, timers) are no longer bounded by the polling cycle. The publish path is unaffected — publisher.publish() is called directly from the tokio side — so data-plane fix-to-subscriber latency is unchanged.

Headline numbers (vs v0.1.0)

Measured across 30–60 min runs at 5 Hz worst-case load (all four constellations dual-band, every feature each mode allows), live hardware A/B on the same antennas:

  • Binary size: −22.3%
  • Process memory footprint (peak resident set size): −1.8% to −3.6% (smallest in moving_base, largest in mb_rover base / rover_ntrip)
  • Steady-state CPU: combining both halves of the work, −35% to −50% across modes; standalone is now ~1% CPU
  • RSS slope after warmup: 2.9–14 kB/min in every run, candidate and baseline within ~2× of each other — heap noise, not growth
  • /fix e2e p50: unchanged in standalone/moving_base; rover_ntrip improved (some of which is rmw_zenoh delivery variance run-to-run); tail (p99/max) trends favourable
  • Output: topic counts match within ±1 message over 30–60 min; no new /rosout WARN/ERROR

All 183 library tests pass unchanged.

Modes covered

standalone, rover_ntrip, moving_base, moving_base_rover, rover_radio. static_base was not exercised — none of these commits touch the survey-in or RTCM-output code paths.

gsokoll added 4 commits May 13, 2026 20:48
…ffers)

Bundle of low-risk changes:

- Cargo.toml: add [profile.release] with fat LTO, codegen-units = 1,
  strip = "debuginfo" for ~5–15% runtime and ~30–60% binary size wins.
  Trim tokio features from "full" to only the set actually used
  (rt-multi-thread, macros, sync, time, signal).
- UBX parsers: add Vec::with_capacity in NAV-SAT, SEC-SIG, SEC-SIGLOG and
  MON-COMMS parse paths to avoid the usual 4→8→16→32→64 growth cascade
  per message.
- Device/NTRIP task state: split internal state into a Mutex-guarded
  struct plus sibling AtomicU64/AtomicU32 counters. Hot-path counter
  increments (messages_received, rtcm_bytes_injected, bytes_received,
  messages_forwarded, connection_count) become lock-free fetch_add.
  The public snapshot type is unchanged.
- Device task: bump the serial read buffer from 1 KiB to 4 KiB so that
  a single USB-CDC frame carrying NAV-SAT + RTCM doesn't force multiple
  read syscalls and partial-frame parser state.

All 183 lib tests pass under cargo test --features ros2.
The device task's UbxProcessResult contains messages that are consumed
purely by the integrity aggregator (NAV-COV, NAV-POSECEF, SEC-SIGLOG,
RXM-COR, MON-COMMS, MON-HW, MON-RF, and fix-type transitions). They were
being cloned, enum-wrapped, sent across the mpsc channel, and then
dropped by the forwarding bridge because into_gnss_message() returned
None for those variants.

Handle them in place: call integrity.update_*() and skip the channel
send. This removes 7-8 clones and the corresponding channel sends per
GNSS epoch (~70-160 heap operations/sec at 10 Hz) and shrinks both
DeviceMessage and the main.rs forwarding bridge.

into_gnss_message() now returns GnssMessage directly. The forwarding
bridge becomes unconditional. No ROS topic content changes because
none of the removed variants was ever published.
After filter-before-send (previous commit), the four heavy payloads
that still crossed mpsc::channel — SatInfo, SecSig, RelPosNed, SurveyIn
— were being cloned by the device task before being sent. Move
ownership instead: update the integrity aggregator with a borrow, then
send the value by move.

While auditing: the IntegrityAggregator kept eight Option<T> "last_*"
fields, but only `last_nav_pl` is ever read back (by compute()). The
other seven were write-only dead retention, each adding a per-epoch
clone on top of the one already being sent through the channel. Drop
those fields and their assignments; the update_*() functions still take
the struct by reference to read the specific metric they care about.

Net effect per GNSS epoch: zero deep clones in process_serial_data for
the four heavy message types, and zero retention clones in the
integrity aggregator for the seven unused snapshots.
The old main loop polled `executor.spin(spin_once().timeout(100ms))` with a
`sleep(10ms)` between iterations — waking the main tokio thread ~9×/sec and
churning the rcl wait set for no reason. Move the executor onto its own OS
thread running `spin(SpinOptions::default())`: it blocks in `rcl_wait` and
wakes the instant a callback is ready. Shutdown goes through
`SpinOptions::until_promise_resolved` (the mechanism that flips the
executor's `halt_spinning` flag *and* triggers its guard conditions to wake
a blocked wait set), wired to a tokio oneshot that's dropped on shutdown.

Effect: at low message rates the polling loop was costing roughly half the
steady-state CPU — that goes away. Control-plane callbacks (parameter
events, service calls, timers) are serviced immediately rather than within
the ~110ms polling cycle. The data path is publish-only and never went
through the executor, so publish→subscribe latency is unchanged.

All 183 lib tests pass; clean startup/spin/publish/shutdown verified against
a live F9P in standalone mode.
@gsokoll gsokoll force-pushed the perf/rclrs-spin-thread branch from 4f199f5 to 7f9a29b Compare May 13, 2026 10:48
@gsokoll gsokoll force-pushed the perf/rclrs-spin-thread branch from 4451074 to 9327023 Compare May 13, 2026 10:57
@gsokoll gsokoll merged commit 69232c9 into master May 13, 2026
7 checks passed
@gsokoll gsokoll deleted the perf/rclrs-spin-thread branch May 13, 2026 11:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant