Skip to content

Release Candidate v6.14.0#1261

Merged
ruck314 merged 85 commits into
mainfrom
pre-release
Jun 15, 2026
Merged

Release Candidate v6.14.0#1261
ruck314 merged 85 commits into
mainfrom
pre-release

Conversation

@ruck314

@ruck314 ruck314 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

ruck314 and others added 30 commits April 18, 2026 19:55
…erage

Adds RoCEv2 (RDMA over Converged Ethernet v2) as a first-class Rogue
protocol and lifts CI line coverage on both the C++ core
(src/rogue/protocols/rocev2/) and the Python control layer
(python/pyrogue/protocols/_RoCEv2.py) to >=80% by exercising real
libibverbs against a Linux Soft-RoCE (rdma_rxe) device — not mocks.

RoCEv2 module
- C++ Core / Server in include/rogue/protocols/rocev2/ and
  src/rogue/protocols/rocev2/, wired into CMake, conda, and the Python
  module graph
- Python RoCEv2Server device (python/pyrogue/protocols/_RoCEv2.py) with
  15 LocalVariables, getChannel() (0..255 range validated), stream
  property, teardownFpgaQp(), and automatic FPGA-side QP teardown on
  _stop()
- Replaced opt-in -DROCEV2=ON with ROGUE_LINUX_BUILD; RoCEv2 ships
  unconditionally on Linux, macOS unaffected, libibverbs discovery
  gated to Linux only

Hardening
- Partial-setup rollback: PD/MR/QP torn down if any stage of
  _roce_setup_connection fails, then the original error re-raised
- Random PSN seeding via std::mt19937 + std::random_device for RC QPs
- pmtu validation in Server::completeConnection and Python _start()
- FpgaQpn / FpgaLkey LocalVariables cleared on teardown
- runThread slot-index bounds guarded; MR slab-overflow guarded;
  stale completion tags cleaned
- min_rnr_timer passed on RTR transition; RNR parameters documented
- _wait_resp zero-pulse polling window shortened from 200 ms to 20 ms
- Core/Server ctor initializer list order fixed; -Wunused-parameter
  silenced in Server::acceptFrame; Server POD members explicitly
  initialised
- Core/Server ownership and destructor lifecycle documented in Sphinx
  docs, C++ headers, and Python docstrings; Server::log_ shadow and
  inheritance chain documented
- ImportError text aligned with the new Linux build
- All-zero GID rejection: ibv_query_gid returns rc=0 on rdma_rxe for
  any gidIndex within gid_tbl_len (1024 on RXE), silently writing an
  all-zero GID for unpopulated slots, which then fails opaquely at
  QP->RTR. Validate non-zero GID and throw with the gidIndex and
  device name pinned in the message
- File-header Description: lines trimmed to slaclab convention
  (short headline), matching the surrounding codebase; full narrative
  lives in Sphinx docs and class-level comments

CI coverage uplift
- full_build_test pinned to ubuntu-24.04 brings up rdma_rxe / rxe0 so
  hardware-gated tests run against real ibverbs
- ROGUE_COVERAGE CMake option applies --coverage PRIVATE to rogue-core
  and rogue-core-static, producing gcov data for the rocev2 subtree
- gcovr Cobertura report filtered to the rocev2 subtree, uploaded via
  codecov-action@v5 alongside the Python pytest-cov report
- Reusable CI shell scripts extracted from rogue_ci.yml;
  check_whitespace.sh scoped to tracked files only

Tests
- tests/protocols/test_rocev2.py: 25 hardware-free + 7 hardware-gated
  tests covering encoders/decoders, metadata-bus round-trips, channel
  wiring, teardown, and live-rxe0 setup/teardown flows
- tests/protocols/test_rocev2_corner_cases.py: 79 MagicMock corner-case
  tests covering decoder/encoder completeness (resp_type 0..3, QP
  state nibble 0..15, _mk_bus 301-bit boundary, field masking and
  ordering, CREATE/MODIFY/DESTROY req_type), partial-setup rollback
  (PD, MR resp-type mismatch, QP-create, QP->INIT ok=false and
  wrong-state, QP->RTR, QP->RTS, hardware-free happy path), teardown
  step-timeout parametrics plus all-timeout pathological case,
  _wait_resp edges, server lifecycle (pmtu=1/5 boundary,
  teardownFpgaQp qpn=0 no-op, exception swallow+log, _stop call
  order, .stream identity), GID edges (0.0.0.0, 255.255.255.255,
  invalid IPv4), getChannel distinct-filter semantics, illegal QP
  transitions, length-bound edges, channel-id bounds
- tests/protocols/conftest.py: shared metadata-bus response-word
  builders (_pack_*_ok / _pack_*_fail / _pack_unknown_type),
  session-scoped rxe_server fixture, wait_for / MemoryRoot re-exports;
  mocked_rocev2_cpp fixture uses pytest.importorskip so the six
  fixture-consuming tests skip cleanly on macOS runners where the C++
  submodule is gated behind ROGUE_LINUX_BUILD
- tests/cpp/protocols/rocev2/: 4 doctest cases for Core/Server with
  RAII wrappers, including 2 live-rxe0 cases and 2 GeneralError /
  out-of-range cases
- 1 test is currently @pytest.mark.skip(...) pending follow-up (the
  live-server init flow); the 3 earlier corner-case deferrals were
  resolved

Known limitations (pinned by tests, not fixed here)
- 32-bit silent truncation in _mk_bus / _encode_alloc_mr: MR length
  >= 2**32 is masked to zero by the (1 << width) - 1 field mask. The
  FPGA-side RoceEngine protocol is 32-bit and the production path
  stays four orders of magnitude below the 4 GiB boundary.
- Python does not enforce the QP state machine: illegal transitions
  are detected only via the FPGA's decoded response (success=False
  plus the current state). The firmware is the authoritative state
  machine; a Python mirror would duplicate its enforcement and risk
  drifting out of sync.

Stability fixes (surfaced while stabilising CI, unrelated to RoCEv2)
- UDP packetizer integration test: enlarged SO_RCVBUF and widened the
  stall window for macOS arm64
- Stream-variable test: settled the enable-frame burst before sampling
  initial_count
- RSSI retran timeout bumped on the loopback clean-path and
  UDP-packetizer loopback tests to absorb GHA runner jitter

Sphinx docs
- 8 new .rst files integrated into the existing protocol toctrees
  alongside RSSI/SRP/Packetizer
- make html builds clean with zero RoCEv2-related warnings
- Jumbo-frame MTU, SoftRoCE rdma_rxe workflow, and surf RoceEngine
  dependency documented

No existing protocol APIs, stream/memory interfaces, or PyRogue device
tree are changed — RoCEv2 itself is purely additive; the only
behavioural change to existing code is enabling --coverage when
ROGUE_COVERAGE=ON.

C++ and Python source ported from FilMarini/rogue:rocev2. Attribution
comments preserved in each ported file pointing back to
FilMarini/rogue:rocev2. Slaclab standards take precedence over API
compatibility; deltas documented in module docstrings.
Corner case #1 (MR length overflow): add tests for the new _start()
guard (test_start_rejects_mr_length_overflow, parametric accept-side
boundary), update encoder-level test docstrings to note the upstream
guard, update resource-bounds section comment with Part D.

Corner case #2 (QP state machine): update comment block with full
design rationale — FPGA is single source of truth, no Python-side
mirror added.

Add "Reprogram the FPGA" assertions to PD/MR/QP failure tests to pin
the new stale-resource error messages.

Add Troubleshooting section to RoCEv2 Sphinx docs covering
stale-resource errors, MR length guard, and QP state-machine
enforcement.
Non-jumbo MTU (1500) splits each 2048-byte frame into 2 RSSI segments
vs 1 for jumbo, doubling the segment traffic through the out-of-order
stage. With period=10, this overwhelms RSSI's retransmit machinery on
macOS arm64 CI runners under pytest-xdist contention, exhausting the
retransmit budget and killing the session before any application frames
are delivered (stall at 0/200).

Use period=50 for non-jumbo links (8 reordered segments out of 400)
which still validates out-of-order handling without exceeding the RSSI
window budget. Jumbo links keep period=10 (20 reordered out of 200).
Override XDG_SESSION_TYPE to x11 before Qt initialization so Qt does not
emit the noisy "Ignoring XDG_SESSION_TYPE=wayland on Gnome" warning.
The override is conditional and only fires on Wayland sessions.
…mpatibility

SqlReader used deprecated 1.x APIs (bound MetaData, autoload=True,
list-wrapped select) and referenced an undefined self._conn attribute.
Replace with 2.x-compatible patterns (unbound MetaData, autoload_with,
engine.connect() context managers) that also work on 1.4+.

Add round-trip tests that write via SqlLogger and read via SqlReader.
…1/V2 SSI ordering

- Application / Transport: raw std::thread* member -> std::unique_ptr<std::thread>;
  remove manual new / delete.
- setController(): wrap std::thread spawn in try/catch; roll back threadEn_ and
  cntl_ if construction throws so the instance stays in its retryable
  pre-call state. Idempotent guard so a second call is a no-op (without it,
  replacing a still-joinable unique_ptr<std::thread> would invoke
  std::terminate via ~thread()).
- ControllerV1: move SSI setError(0x80) to BEFORE pushFrame() and
  tranFrame_[0].reset(), and apply it to tranFrame_[0] (the assembled
  frame) instead of tranFrame_[tmpDest]. Original code dereferenced a
  null shared_ptr for any tmpDest != 0, and silently dropped the SSI
  flag for tmpDest == 0.
- ControllerV2: same setError ordering fix - apply before push + reset
  so the downstream slave observes the error flag instead of the call
  hitting a just-reset (null) slot.
…s atomic

Concurrent first-callers could observe partially-initialised state.
Declares Version::init()'s statics atomic and guards the one-time
initialisation with std::call_once.
Remove tests/protocols/test_batcher_audit_repros.py — the file walked
InverterV1.cpp / InverterV2.cpp source text looking for bounds-check
tokens before the memcpy, rather than functionally exercising the
patched code paths. Per maintainer feedback that style of test is too
brittle to be useful as CI; the audit/guardrail intent is better
expressed as an AI agent directive at PR-review time.
…for transaction completion

pr.startTransaction() accepts a `check` keyword argument; when True, the
underlying rim::Block::startTransaction calls checkTransaction() to wait
for the slave to acknowledge the transaction before returning. Several
callers in pyrogue passed `checkEach=True` instead, which is silently
dropped into the function's **kwargs (the docstring marks them
"unused"). The wait was therefore disabled, allowing back-to-back
operations (notably BaseCommand.toggle which issues set(1); set(0)) to
race the shared blockData_ buffer: the second caller mutates the buffer
before the slave's worker has consumed the data pointer for the first
transaction, the slave observes the post-mutation value for both, and
the hardware never sees the rising edge. In the field this manifests
as "status counter reset never fires" on PGPv4 and any other
RemoteCommand that uses BaseCommand.toggle against an asynchronous
memory slave (SRP-over-UDP/TCP, network bridges, DMA-backed AXI
masters).

The typo was introduced by cff241e (2020-08-11, "Fix command and poll
queue") on RemoteCommand.set/get, and the same pattern was cloned into
RemoteVariable.post and Device.{write,read,verify}Blocks. The block-list
helpers in _Block.py (writeBlocks, readBlocks, verifyBlocks,
writeAndVerifyBlocks, readAndCheckBlocks) correctly translate
`checkEach=` to `check=` and are not affected.

Fixed call sites (all switched from `checkEach=` to `check=` when calling
pr.startTransaction):

  _Command.py:557  RemoteCommand.set
  _Command.py:583  RemoteCommand.get
  _Variable.py:1636  RemoteVariable.post
  _Device.py:534    Device.writeBlocks variable path
  _Device.py:539    Device.writeBlocks block-iteration path
  _Device.py:569    Device.verifyBlocks variable path
  _Device.py:574    Device.verifyBlocks block-iteration path
  _Device.py:607    Device.readBlocks variable path
  _Device.py:612    Device.readBlocks block-iteration path

Regression coverage: tests/core/test_transaction_check_kwarg_race.py
adds 9 contract-level tests (one per fixed site, asserting the correct
`check=` kwarg propagates to pr.startTransaction) and one end-to-end
behavioral test that exercises BaseCommand.toggle against an
asynchronous in-memory slave. The behavioral test mimics real
network/DMA slaves by deferring transaction.getData() to a worker
thread; with the fix the slave records [(addr, 1), (addr, 0)], without
the fix the slave records [(addr, 0), (addr, 0)].

An existing test in tests/core/test_core_structure_edge_cases.py was
reading `kwargs.get("checkEach")` in its monkeypatch of
pr.startTransaction, which silently captured None regardless of the
user-supplied checkEach value (it was implicitly validating the bug).
Updated to read `kwargs.get("check")` so it correctly verifies that
Device.forceCheckEach=True propagates to the transaction layer.
bengineerd and others added 23 commits June 1, 2026 12:43
…with `wait`

- Introduced `_warn_deprecated` and `_resolve_deprecated_bool` helper functions to manage deprecated argument warnings and resolve conflicts between new and old argument names.
- Updated `BaseVariable`, `RemoteVariable`, and `LocalVariable` classes to replace `check` with `wait` in method signatures and implementations.
- Modified transaction-related methods in `PollQueue`, `Root`, and other classes to use `wait` instead of `check`.
- Updated tests to ensure compatibility with the new `wait` argument and verify that deprecated `check` arguments trigger appropriate warnings.
- Ensured that all relevant documentation and comments reflect the changes from `check` to `wait`.
ci: bump GitHub Actions to latest majors (Node 20 deprecation)
Add RoCEv2 protocol module with Soft-RoCE CI coverage and hardening
PR #1232 (ESROGUE-711) fixed _varDone() pickling the live _updateList dict
by wrapping _varUpdate/_varDone in a threading.Lock plus a snapshot-and-swap.
The per-variable lock on _varUpdate is the hot path (one call per variable
per update group, on the already-serializing ZMQ flush), which drew an
"overly aggressive locking" concern on the merge.

Replace the shared dict + lock with a collections.deque:
- _varUpdate appends (path, value) -- a single GIL-atomic, lock-free op.
- _varDone drains the deque into a thread-private dict, then pickles that
  private snapshot unlocked.

pickle.dumps only ever iterates the frame-local snapshot, so a concurrent
_varUpdate (appending to the deque) can no longer mutate the object being
serialized -- "dictionary changed size during iteration" becomes structurally
impossible rather than lock-guarded. Published payloads are unchanged
(last-write-wins per path, first-occurrence key order). The existing
multithread race tests remain valid and pass; the slow-pickle reproducer is
deterministically green.

stream.interfaces.Variable is left untouched (PyYAML's represent_mapping
snapshots via list(mapping.items()), so it was never exploitable).
…imeout(warnTime, failTime)

Collapse the three coupled recv-retry knobs introduced in #1237 into a
warn/fail time pair that mirrors the memory Transaction model:

  void setTimeout(uint32_t warnTime, uint32_t failTime = 0);

- warnTime drives ZMQ_RCVTIMEO and the periodic "still waiting" warning.
- failTime is an absolute wait deadline (ms); the recv loops in send()/
  sendString() throw once the accumulated wait reaches it. failTime == 0
  (default) installs no deadline and retries forever, preserving the
  historic #1236 contract.
- stop() now sets an atomic stopping_ sentinel to break an in-flight
  forever loop, replacing the old waitRetry_ = false unblock.

The Python binding rejects bool arguments with a TypeError so stale
(msecs, waitRetry) callers fail loudly instead of silently coercing
True to failTime = 1.

_Virtual.py: connection test (1000, False) -> (1000, 1000); operational
(1000, True) -> (1000, 0). Tests updated to the warn/fail contract.
…ock_

Address PR review feedback:

- setTimeout() now rejects warnTime == 0. A 0 ms warn period sets
  ZMQ_RCVTIMEO to 0 (non-blocking), so the send()/sendString() recv loop
  would busy-spin without ever advancing toward failTime.
- stop() acquires reqLock_ before closing zmqReq_ and destroying the
  context, serializing teardown against an in-flight send()/sendString().
  The stopping_ sentinel (set first) guarantees an in-flight loop releases
  reqLock_ within one RCVTIMEO period, so this cannot deadlock.

Add a regression test for the zero warnTime rejection.
Add deleted setTimeout(uint32_t, bool) and
setTimeout(uint32_t, bool, uint32_t) overloads so legacy C++ calls like
setTimeout(1000, true) fail to compile instead of silently converting the
bool to uint32_t and setting failTime = 1 ms. The Python binding already
rejects bool arguments at runtime via the setTimeoutPy wrapper.
…y CI discovery

Build every p4p client bound directly to the running EpicsPvServer
instead of a bare Context('pva') that relies on UDP broadcast/search
discovery. That discovery path is unreliable over loopback on CI
runners (the same limitation that hard-skips this test on macOS ARM64),
causing intermittent 'last=unavailable' timeouts.

make_pva_client() reads the server's actual bound TCP port from
Server.conf() and points the client at it via EPICS_PVA_NAME_SERVERS,
which performs the channel search over a direct TCP connection with no
UDP involved. AUTO_ADDR_LIST=NO, an empty ADDR_LIST, and useenv=False
keep ambient EPICS_* settings from re-introducing UDP discovery.
Reading the real port also tolerates the EPICS default ports being
unbindable on CI, where the server falls back to an ephemeral port.

A fresh client is built per server instance; the main client and both
RPC clients use context managers so each is closed before its server is
torn down. The helper raises a clear RuntimeError (converted to a
pytest.skip) if the server has not started or exposes no server port.

Test-only change; no production code, public API, or documentation is
affected.
fix(tests): connect epicsV4 PVA client to server TCP port to fix flaky CI discovery
ZmqClient: replace setTimeout(msecs, waitRetry, maxRetries) with setTimeout(warnTime, failTime)
fix(zmq): replace ZmqServer update lock with a lock-free deque
Fix RemoteCommand, RemoteVariable.post, and Device block ops to wait for transaction completion
Document agent build environment guidance
@ruck314 ruck314 requested review from bengineerd and slacrherbst June 6, 2026 19:38
)

Block::getBytes and Block::setBytes released the Python GIL unconditionally on
every call. During a bulk update-queue drain (~42k staged variable updates)
that is ~42k PyEval_SaveThread/PyEval_RestoreThread cycles; under GIL
contention each hand-off costs milliseconds, stalling the drain for tens of
seconds -- the v6 regression reported in #1262.

The method bodies are pure in-memory work (memcpy/copyBits/byte-reverse) with
no Python calls, so the GIL only needs to be released while waiting on a
contended block mutex -- the case that could otherwise deadlock against a
memory-transaction worker holding the mutex while needing the GIL for a Python
callback. Acquire the mutex with try_to_lock and drop the GIL only on the
contended fallback path, keeping the GIL on the uncontended fast path.

Adds a perf reproduction (tests/perf) driving a 42k staged set/get drain under
GIL contention, and a deterministic round-trip correctness test (tests/core)
covering the scalar, list, and byte-reversed paths plus the try_to_lock
fallback branch.
@ruck314 ruck314 marked this pull request as ready for review June 15, 2026 15:59
fix(memory): make Block getBytes/setBytes GIL release conditional
@ruck314 ruck314 merged commit 263dace into main Jun 15, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants