Release Candidate v6.14.0 by ruck314 · Pull Request #1261 · slaclab/rogue

ruck314 · 2026-06-06T19:38:06Z

Description

…erage Adds RoCEv2 (RDMA over Converged Ethernet v2) as a first-class Rogue protocol and lifts CI line coverage on both the C++ core (src/rogue/protocols/rocev2/) and the Python control layer (python/pyrogue/protocols/_RoCEv2.py) to >=80% by exercising real libibverbs against a Linux Soft-RoCE (rdma_rxe) device — not mocks. RoCEv2 module - C++ Core / Server in include/rogue/protocols/rocev2/ and src/rogue/protocols/rocev2/, wired into CMake, conda, and the Python module graph - Python RoCEv2Server device (python/pyrogue/protocols/_RoCEv2.py) with 15 LocalVariables, getChannel() (0..255 range validated), stream property, teardownFpgaQp(), and automatic FPGA-side QP teardown on _stop() - Replaced opt-in -DROCEV2=ON with ROGUE_LINUX_BUILD; RoCEv2 ships unconditionally on Linux, macOS unaffected, libibverbs discovery gated to Linux only Hardening - Partial-setup rollback: PD/MR/QP torn down if any stage of _roce_setup_connection fails, then the original error re-raised - Random PSN seeding via std::mt19937 + std::random_device for RC QPs - pmtu validation in Server::completeConnection and Python _start() - FpgaQpn / FpgaLkey LocalVariables cleared on teardown - runThread slot-index bounds guarded; MR slab-overflow guarded; stale completion tags cleaned - min_rnr_timer passed on RTR transition; RNR parameters documented - _wait_resp zero-pulse polling window shortened from 200 ms to 20 ms - Core/Server ctor initializer list order fixed; -Wunused-parameter silenced in Server::acceptFrame; Server POD members explicitly initialised - Core/Server ownership and destructor lifecycle documented in Sphinx docs, C++ headers, and Python docstrings; Server::log_ shadow and inheritance chain documented - ImportError text aligned with the new Linux build - All-zero GID rejection: ibv_query_gid returns rc=0 on rdma_rxe for any gidIndex within gid_tbl_len (1024 on RXE), silently writing an all-zero GID for unpopulated slots, which then fails opaquely at QP->RTR. Validate non-zero GID and throw with the gidIndex and device name pinned in the message - File-header Description: lines trimmed to slaclab convention (short headline), matching the surrounding codebase; full narrative lives in Sphinx docs and class-level comments CI coverage uplift - full_build_test pinned to ubuntu-24.04 brings up rdma_rxe / rxe0 so hardware-gated tests run against real ibverbs - ROGUE_COVERAGE CMake option applies --coverage PRIVATE to rogue-core and rogue-core-static, producing gcov data for the rocev2 subtree - gcovr Cobertura report filtered to the rocev2 subtree, uploaded via codecov-action@v5 alongside the Python pytest-cov report - Reusable CI shell scripts extracted from rogue_ci.yml; check_whitespace.sh scoped to tracked files only Tests - tests/protocols/test_rocev2.py: 25 hardware-free + 7 hardware-gated tests covering encoders/decoders, metadata-bus round-trips, channel wiring, teardown, and live-rxe0 setup/teardown flows - tests/protocols/test_rocev2_corner_cases.py: 79 MagicMock corner-case tests covering decoder/encoder completeness (resp_type 0..3, QP state nibble 0..15, _mk_bus 301-bit boundary, field masking and ordering, CREATE/MODIFY/DESTROY req_type), partial-setup rollback (PD, MR resp-type mismatch, QP-create, QP->INIT ok=false and wrong-state, QP->RTR, QP->RTS, hardware-free happy path), teardown step-timeout parametrics plus all-timeout pathological case, _wait_resp edges, server lifecycle (pmtu=1/5 boundary, teardownFpgaQp qpn=0 no-op, exception swallow+log, _stop call order, .stream identity), GID edges (0.0.0.0, 255.255.255.255, invalid IPv4), getChannel distinct-filter semantics, illegal QP transitions, length-bound edges, channel-id bounds - tests/protocols/conftest.py: shared metadata-bus response-word builders (_pack_*_ok / _pack_*_fail / _pack_unknown_type), session-scoped rxe_server fixture, wait_for / MemoryRoot re-exports; mocked_rocev2_cpp fixture uses pytest.importorskip so the six fixture-consuming tests skip cleanly on macOS runners where the C++ submodule is gated behind ROGUE_LINUX_BUILD - tests/cpp/protocols/rocev2/: 4 doctest cases for Core/Server with RAII wrappers, including 2 live-rxe0 cases and 2 GeneralError / out-of-range cases - 1 test is currently @pytest.mark.skip(...) pending follow-up (the live-server init flow); the 3 earlier corner-case deferrals were resolved Known limitations (pinned by tests, not fixed here) - 32-bit silent truncation in _mk_bus / _encode_alloc_mr: MR length >= 2**32 is masked to zero by the (1 << width) - 1 field mask. The FPGA-side RoceEngine protocol is 32-bit and the production path stays four orders of magnitude below the 4 GiB boundary. - Python does not enforce the QP state machine: illegal transitions are detected only via the FPGA's decoded response (success=False plus the current state). The firmware is the authoritative state machine; a Python mirror would duplicate its enforcement and risk drifting out of sync. Stability fixes (surfaced while stabilising CI, unrelated to RoCEv2) - UDP packetizer integration test: enlarged SO_RCVBUF and widened the stall window for macOS arm64 - Stream-variable test: settled the enable-frame burst before sampling initial_count - RSSI retran timeout bumped on the loopback clean-path and UDP-packetizer loopback tests to absorb GHA runner jitter Sphinx docs - 8 new .rst files integrated into the existing protocol toctrees alongside RSSI/SRP/Packetizer - make html builds clean with zero RoCEv2-related warnings - Jumbo-frame MTU, SoftRoCE rdma_rxe workflow, and surf RoceEngine dependency documented No existing protocol APIs, stream/memory interfaces, or PyRogue device tree are changed — RoCEv2 itself is purely additive; the only behavioural change to existing code is enabling --coverage when ROGUE_COVERAGE=ON. C++ and Python source ported from FilMarini/rogue:rocev2. Attribution comments preserved in each ported file pointing back to FilMarini/rogue:rocev2. Slaclab standards take precedence over API compatibility; deltas documented in module docstrings.

Corner case #1 (MR length overflow): add tests for the new _start() guard (test_start_rejects_mr_length_overflow, parametric accept-side boundary), update encoder-level test docstrings to note the upstream guard, update resource-bounds section comment with Part D. Corner case #2 (QP state machine): update comment block with full design rationale — FPGA is single source of truth, no Python-side mirror added. Add "Reprogram the FPGA" assertions to PD/MR/QP failure tests to pin the new stale-resource error messages. Add Troubleshooting section to RoCEv2 Sphinx docs covering stale-resource errors, MR length guard, and QP state-machine enforcement.

Non-jumbo MTU (1500) splits each 2048-byte frame into 2 RSSI segments vs 1 for jumbo, doubling the segment traffic through the out-of-order stage. With period=10, this overwhelms RSSI's retransmit machinery on macOS arm64 CI runners under pytest-xdist contention, exhausting the retransmit budget and killing the session before any application frames are delivered (stall at 0/200). Use period=50 for non-jumbo links (8 reordered segments out of 400) which still validates out-of-order handling without exceeding the RSSI window budget. Jumbo links keep period=10 (20 reordered out of 200).

…Frame

Override XDG_SESSION_TYPE to x11 before Qt initialization so Qt does not emit the noisy "Ignoring XDG_SESSION_TYPE=wayland on Gnome" warning. The override is conditional and only fires on Wayland sessions.

…mpatibility SqlReader used deprecated 1.x APIs (bound MetaData, autoload=True, list-wrapped select) and referenced an undefined self._conn attribute. Replace with 2.x-compatible patterns (unbound MetaData, autoload_with, engine.connect() context managers) that also work on 1.4+. Add round-trip tests that write via SqlLogger and read via SqlReader.

…ram in SqlReader

…1/V2 SSI ordering - Application / Transport: raw std::thread* member -> std::unique_ptr<std::thread>; remove manual new / delete. - setController(): wrap std::thread spawn in try/catch; roll back threadEn_ and cntl_ if construction throws so the instance stays in its retryable pre-call state. Idempotent guard so a second call is a no-op (without it, replacing a still-joinable unique_ptr<std::thread> would invoke std::terminate via ~thread()). - ControllerV1: move SSI setError(0x80) to BEFORE pushFrame() and tranFrame_[0].reset(), and apply it to tranFrame_[0] (the assembled frame) instead of tranFrame_[tmpDest]. Original code dereferenced a null shared_ptr for any tmpDest != 0, and silently dropped the SSI flag for tmpDest == 0. - ControllerV2: same setError ordering fix - apply before push + reset so the downstream slave observes the error flag instead of the call hitting a just-reset (null) slot.

…s atomic Concurrent first-callers could observe partially-initialised state. Declares Version::init()'s statics atomic and guards the one-time initialisation with std::call_once.

Remove tests/protocols/test_batcher_audit_repros.py — the file walked InverterV1.cpp / InverterV2.cpp source text looking for bounds-check tokens before the memcpy, rather than functionally exercising the patched code paths. Per maintainer feedback that style of test is too brittle to be useful as CI; the audit/guardrail intent is better expressed as an AI agent directive at PR-review time.

…for transaction completion pr.startTransaction() accepts a `check` keyword argument; when True, the underlying rim::Block::startTransaction calls checkTransaction() to wait for the slave to acknowledge the transaction before returning. Several callers in pyrogue passed `checkEach=True` instead, which is silently dropped into the function's **kwargs (the docstring marks them "unused"). The wait was therefore disabled, allowing back-to-back operations (notably BaseCommand.toggle which issues set(1); set(0)) to race the shared blockData_ buffer: the second caller mutates the buffer before the slave's worker has consumed the data pointer for the first transaction, the slave observes the post-mutation value for both, and the hardware never sees the rising edge. In the field this manifests as "status counter reset never fires" on PGPv4 and any other RemoteCommand that uses BaseCommand.toggle against an asynchronous memory slave (SRP-over-UDP/TCP, network bridges, DMA-backed AXI masters). The typo was introduced by cff241e (2020-08-11, "Fix command and poll queue") on RemoteCommand.set/get, and the same pattern was cloned into RemoteVariable.post and Device.{write,read,verify}Blocks. The block-list helpers in _Block.py (writeBlocks, readBlocks, verifyBlocks, writeAndVerifyBlocks, readAndCheckBlocks) correctly translate `checkEach=` to `check=` and are not affected. Fixed call sites (all switched from `checkEach=` to `check=` when calling pr.startTransaction): _Command.py:557 RemoteCommand.set _Command.py:583 RemoteCommand.get _Variable.py:1636 RemoteVariable.post _Device.py:534 Device.writeBlocks variable path _Device.py:539 Device.writeBlocks block-iteration path _Device.py:569 Device.verifyBlocks variable path _Device.py:574 Device.verifyBlocks block-iteration path _Device.py:607 Device.readBlocks variable path _Device.py:612 Device.readBlocks block-iteration path Regression coverage: tests/core/test_transaction_check_kwarg_race.py adds 9 contract-level tests (one per fixed site, asserting the correct `check=` kwarg propagates to pr.startTransaction) and one end-to-end behavioral test that exercises BaseCommand.toggle against an asynchronous in-memory slave. The behavioral test mimics real network/DMA slaves by deferring transaction.getData() to a worker thread; with the fix the slave records [(addr, 1), (addr, 0)], without the fix the slave records [(addr, 0), (addr, 0)]. An existing test in tests/core/test_core_structure_edge_cases.py was reading `kwargs.get("checkEach")` in its monkeypatch of pr.startTransaction, which silently captured None regardless of the user-supplied checkEach value (it was implicitly validating the bug). Updated to read `kwargs.get("check")` so it correctly verifies that Device.forceCheckEach=True propagates to the transaction layer.

…ranch

…with `wait` - Introduced `_warn_deprecated` and `_resolve_deprecated_bool` helper functions to manage deprecated argument warnings and resolve conflicts between new and old argument names. - Updated `BaseVariable`, `RemoteVariable`, and `LocalVariable` classes to replace `check` with `wait` in method signatures and implementations. - Modified transaction-related methods in `PollQueue`, `Root`, and other classes to use `wait` instead of `check`. - Updated tests to ensure compatibility with the new `wait` argument and verify that deprecated `check` arguments trigger appropriate warnings. - Ensured that all relevant documentation and comments reflect the changes from `check` to `wait`.

ci: bump GitHub Actions to latest majors (Node 20 deprecation)

… and Block operations

…entation and code comments

Add RoCEv2 protocol module with Soft-RoCE CI coverage and hardening

…lab/rogue into RemoteCommand-checkEach-fix

PR #1232 (ESROGUE-711) fixed _varDone() pickling the live _updateList dict by wrapping _varUpdate/_varDone in a threading.Lock plus a snapshot-and-swap. The per-variable lock on _varUpdate is the hot path (one call per variable per update group, on the already-serializing ZMQ flush), which drew an "overly aggressive locking" concern on the merge. Replace the shared dict + lock with a collections.deque: - _varUpdate appends (path, value) -- a single GIL-atomic, lock-free op. - _varDone drains the deque into a thread-private dict, then pickles that private snapshot unlocked. pickle.dumps only ever iterates the frame-local snapshot, so a concurrent _varUpdate (appending to the deque) can no longer mutate the object being serialized -- "dictionary changed size during iteration" becomes structurally impossible rather than lock-guarded. Published payloads are unchanged (last-write-wins per path, first-occurrence key order). The existing multithread race tests remain valid and pass; the slow-pickle reproducer is deterministically green. stream.interfaces.Variable is left untouched (PyYAML's represent_mapping snapshots via list(mapping.items()), so it was never exploitable).

…imeout(warnTime, failTime) Collapse the three coupled recv-retry knobs introduced in #1237 into a warn/fail time pair that mirrors the memory Transaction model: void setTimeout(uint32_t warnTime, uint32_t failTime = 0); - warnTime drives ZMQ_RCVTIMEO and the periodic "still waiting" warning. - failTime is an absolute wait deadline (ms); the recv loops in send()/ sendString() throw once the accumulated wait reaches it. failTime == 0 (default) installs no deadline and retries forever, preserving the historic #1236 contract. - stop() now sets an atomic stopping_ sentinel to break an in-flight forever loop, replacing the old waitRetry_ = false unblock. The Python binding rejects bool arguments with a TypeError so stale (msecs, waitRetry) callers fail loudly instead of silently coercing True to failTime = 1. _Virtual.py: connection test (1000, False) -> (1000, 1000); operational (1000, True) -> (1000, 0). Tests updated to the warn/fail contract.

…ock_ Address PR review feedback: - setTimeout() now rejects warnTime == 0. A 0 ms warn period sets ZMQ_RCVTIMEO to 0 (non-blocking), so the send()/sendString() recv loop would busy-spin without ever advancing toward failTime. - stop() acquires reqLock_ before closing zmqReq_ and destroying the context, serializing teardown against an in-flight send()/sendString(). The stopping_ sentinel (set first) guarantees an in-flight loop releases reqLock_ within one RCVTIMEO period, so this cannot deadlock. Add a regression test for the zero warnTime rejection.

Add deleted setTimeout(uint32_t, bool) and setTimeout(uint32_t, bool, uint32_t) overloads so legacy C++ calls like setTimeout(1000, true) fail to compile instead of silently converting the bool to uint32_t and setting failTime = 1 ms. The Python binding already rejects bool arguments at runtime via the setTimeoutPy wrapper.

…y CI discovery Build every p4p client bound directly to the running EpicsPvServer instead of a bare Context('pva') that relies on UDP broadcast/search discovery. That discovery path is unreliable over loopback on CI runners (the same limitation that hard-skips this test on macOS ARM64), causing intermittent 'last=unavailable' timeouts. make_pva_client() reads the server's actual bound TCP port from Server.conf() and points the client at it via EPICS_PVA_NAME_SERVERS, which performs the channel search over a direct TCP connection with no UDP involved. AUTO_ADDR_LIST=NO, an empty ADDR_LIST, and useenv=False keep ambient EPICS_* settings from re-introducing UDP discovery. Reading the real port also tolerates the EPICS default ports being unbindable on CI, where the server falls back to an ephemeral port. A fresh client is built per server instance; the main client and both RPC clients use context managers so each is closed before its server is torn down. The helper raises a clear RuntimeError (converted to a pytest.skip) if the server has not started or exposes no server port. Test-only change; no production code, public API, or documentation is affected.

fix(tests): connect epicsV4 PVA client to server TCP port to fix flaky CI discovery

ZmqClient: replace setTimeout(msecs, waitRetry, maxRetries) with setTimeout(warnTime, failTime)

fix(zmq): replace ZmqServer update lock with a lock-free deque

Fix RemoteCommand, RemoteVariable.post, and Device block ops to wait for transaction completion

Document agent build environment guidance

) Block::getBytes and Block::setBytes released the Python GIL unconditionally on every call. During a bulk update-queue drain (~42k staged variable updates) that is ~42k PyEval_SaveThread/PyEval_RestoreThread cycles; under GIL contention each hand-off costs milliseconds, stalling the drain for tens of seconds -- the v6 regression reported in #1262. The method bodies are pure in-memory work (memcpy/copyBits/byte-reverse) with no Python calls, so the GIL only needs to be released while waiting on a contended block mutex -- the case that could otherwise deadlock against a memory-transaction worker holding the mutex while needing the GIL for a Python callback. Acquire the mutex with try_to_lock and drop the GIL only on the contended fallback path, keeping the GIL on the uncontended fast path. Adds a perf reproduction (tests/perf) driving a 42k staged set/get drain under GIL contention, and a deterministic round-trip correctness test (tests/core) covering the scalar, list, and byte-reversed paths plus the try_to_lock fallback branch.

fix(memory): make Block getBytes/setBytes GIL release conditional

ruck314 and others added 30 commits April 18, 2026 19:55

Merge branch 'pre-release' into RoCEv2-dev

d279027

Merge branch 'pre-release' into RoCEv2-dev

a9c0ac0

corner case #1 and #2

8a1da55

git co origin/pre-release -- tests/interfaces/test_stream_variable.py

4b5d47e

Merge branch 'pre-release' into RoCEv2-dev

d385712

Merge branch 'pre-release' into RoCEv2-dev

e238ba0

fix(batcher): add frame-payload bounds checks to InverterV1/V2 accept…

b17f9e9

…Frame

Suppress Qt Wayland warning in PyDM launcher

f2cf2f4

Override XDG_SESSION_TYPE to x11 before Qt initialization so Qt does not emit the noisy "Ignoring XDG_SESSION_TYPE=wayland on Gnome" warning. The override is conditional and only fires on Wayland sessions.

fix(pyrogue::interfaces): add engine-None guards and remove unused pa…

33d251e

…ram in SqlReader

fix(version): guard Version::init() with call_once and declare static…

c585057

…s atomic Concurrent first-callers could observe partially-initialised state. Declares Version::init()'s statics atomic and guards the one-time initialisation with std::call_once.

Merge branch 'pre-release' into bug-audit/packetizer

61d097b

Merge branch 'pre-release' into bug-audit/batcher

958f7d8

Merge branch 'pre-release' into bug-audit/version

2b9f7f1

Merge branch 'pre-release' into XDG_SESSION_TYPE

3879558

Merge branch 'pre-release' into SQLAlchemy

28ba628

Update meta.yaml - resolving merge conflict with latest pre-release b…

0de1a51

…ranch

Merge branch 'pre-release' into RoCEv2-dev

8517445

Merge branch 'pre-release' into RemoteCommand-checkEach-fix

314d100

Merge branch 'pre-release' into SQLAlchemy

7e6c6d4

Merge branch 'pre-release' into XDG_SESSION_TYPE

fa40435

Merge branch 'pre-release' into bug-audit/version

1838b30

Merge branch 'pre-release' into bug-audit/batcher

6468ab9

Merge branch 'pre-release' into bug-audit/packetizer

8ad14b6

bengineerd and others added 23 commits June 1, 2026 12:43

Merge pull request #1241 from slaclab/ci/bump-actions-node20

ec07d69

ci: bump GitHub Actions to latest majors (Node 20 deprecation)

Merge branch 'pre-release' into RemoteCommand-checkEach-fix

3fb5315

Merge branch 'pre-release' into RoCEv2-dev

6f9c2cc

Update documentation to clarify memory transaction handling in Device…

53a9061

… and Block operations

Refactor transaction handling to replace check with wait in docum…

e6cd170

…entation and code comments

Merge pull request #1186 from slaclab/RoCEv2-dev

bf6bec6

Add RoCEv2 protocol module with Soft-RoCE CI coverage and hardening

Merge branch 'RemoteCommand-checkEach-fix' of https://github.com/slac…

4dfe9eb

…lab/rogue into RemoteCommand-checkEach-fix

Merge branch 'pre-release' into RemoteCommand-checkEach-fix

fe638c5

Merge pull request #1258 from slaclab/tests/protocols/test_epicsV4/patch

36a2cd9

fix(tests): connect epicsV4 PVA client to server TCP port to fix flaky CI discovery

Merge branch 'pre-release' into zmq-settimeout-warn-fail

4023086

Merge pull request #1257 from slaclab/zmq-settimeout-warn-fail

6d90531

ZmqClient: replace setTimeout(msecs, waitRetry, maxRetries) with setTimeout(warnTime, failTime)

Merge branch 'pre-release' into zmq-updatelist-deque

a7ec8b1

Merge pull request #1256 from slaclab/zmq-updatelist-deque

897e802

fix(zmq): replace ZmqServer update lock with a lock-free deque

Merge branch 'pre-release' into RemoteCommand-checkEach-fix

cb781c1

Merge pull request #1233 from slaclab/RemoteCommand-checkEach-fix

f4d1ccf

Fix RemoteCommand, RemoteVariable.post, and Device block ops to wait for transaction completion

Document agent build environment guidance

285c177

Merge pull request #1260 from slaclab/codex/agent-build-env-guidance

f1f2b3a

Document agent build environment guidance

ruck314 requested review from bengineerd and slacrherbst June 6, 2026 19:38

ruck314 marked this pull request as ready for review June 15, 2026 15:59

Merge pull request #1263 from /issues/1262

b838c82

fix(memory): make Block getBytes/setBytes GIL release conditional

slacrherbst approved these changes Jun 15, 2026

View reviewed changes

ruck314 merged commit 263dace into main Jun 15, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release Candidate v6.14.0#1261

Release Candidate v6.14.0#1261
ruck314 merged 85 commits into
mainfrom
pre-release

ruck314 commented Jun 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

ruck314 commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ruck314 commented Jun 6, 2026 •

edited

Loading