Release Candidate v6.14.0#1261
Merged
Merged
Conversation
…erage Adds RoCEv2 (RDMA over Converged Ethernet v2) as a first-class Rogue protocol and lifts CI line coverage on both the C++ core (src/rogue/protocols/rocev2/) and the Python control layer (python/pyrogue/protocols/_RoCEv2.py) to >=80% by exercising real libibverbs against a Linux Soft-RoCE (rdma_rxe) device — not mocks. RoCEv2 module - C++ Core / Server in include/rogue/protocols/rocev2/ and src/rogue/protocols/rocev2/, wired into CMake, conda, and the Python module graph - Python RoCEv2Server device (python/pyrogue/protocols/_RoCEv2.py) with 15 LocalVariables, getChannel() (0..255 range validated), stream property, teardownFpgaQp(), and automatic FPGA-side QP teardown on _stop() - Replaced opt-in -DROCEV2=ON with ROGUE_LINUX_BUILD; RoCEv2 ships unconditionally on Linux, macOS unaffected, libibverbs discovery gated to Linux only Hardening - Partial-setup rollback: PD/MR/QP torn down if any stage of _roce_setup_connection fails, then the original error re-raised - Random PSN seeding via std::mt19937 + std::random_device for RC QPs - pmtu validation in Server::completeConnection and Python _start() - FpgaQpn / FpgaLkey LocalVariables cleared on teardown - runThread slot-index bounds guarded; MR slab-overflow guarded; stale completion tags cleaned - min_rnr_timer passed on RTR transition; RNR parameters documented - _wait_resp zero-pulse polling window shortened from 200 ms to 20 ms - Core/Server ctor initializer list order fixed; -Wunused-parameter silenced in Server::acceptFrame; Server POD members explicitly initialised - Core/Server ownership and destructor lifecycle documented in Sphinx docs, C++ headers, and Python docstrings; Server::log_ shadow and inheritance chain documented - ImportError text aligned with the new Linux build - All-zero GID rejection: ibv_query_gid returns rc=0 on rdma_rxe for any gidIndex within gid_tbl_len (1024 on RXE), silently writing an all-zero GID for unpopulated slots, which then fails opaquely at QP->RTR. Validate non-zero GID and throw with the gidIndex and device name pinned in the message - File-header Description: lines trimmed to slaclab convention (short headline), matching the surrounding codebase; full narrative lives in Sphinx docs and class-level comments CI coverage uplift - full_build_test pinned to ubuntu-24.04 brings up rdma_rxe / rxe0 so hardware-gated tests run against real ibverbs - ROGUE_COVERAGE CMake option applies --coverage PRIVATE to rogue-core and rogue-core-static, producing gcov data for the rocev2 subtree - gcovr Cobertura report filtered to the rocev2 subtree, uploaded via codecov-action@v5 alongside the Python pytest-cov report - Reusable CI shell scripts extracted from rogue_ci.yml; check_whitespace.sh scoped to tracked files only Tests - tests/protocols/test_rocev2.py: 25 hardware-free + 7 hardware-gated tests covering encoders/decoders, metadata-bus round-trips, channel wiring, teardown, and live-rxe0 setup/teardown flows - tests/protocols/test_rocev2_corner_cases.py: 79 MagicMock corner-case tests covering decoder/encoder completeness (resp_type 0..3, QP state nibble 0..15, _mk_bus 301-bit boundary, field masking and ordering, CREATE/MODIFY/DESTROY req_type), partial-setup rollback (PD, MR resp-type mismatch, QP-create, QP->INIT ok=false and wrong-state, QP->RTR, QP->RTS, hardware-free happy path), teardown step-timeout parametrics plus all-timeout pathological case, _wait_resp edges, server lifecycle (pmtu=1/5 boundary, teardownFpgaQp qpn=0 no-op, exception swallow+log, _stop call order, .stream identity), GID edges (0.0.0.0, 255.255.255.255, invalid IPv4), getChannel distinct-filter semantics, illegal QP transitions, length-bound edges, channel-id bounds - tests/protocols/conftest.py: shared metadata-bus response-word builders (_pack_*_ok / _pack_*_fail / _pack_unknown_type), session-scoped rxe_server fixture, wait_for / MemoryRoot re-exports; mocked_rocev2_cpp fixture uses pytest.importorskip so the six fixture-consuming tests skip cleanly on macOS runners where the C++ submodule is gated behind ROGUE_LINUX_BUILD - tests/cpp/protocols/rocev2/: 4 doctest cases for Core/Server with RAII wrappers, including 2 live-rxe0 cases and 2 GeneralError / out-of-range cases - 1 test is currently @pytest.mark.skip(...) pending follow-up (the live-server init flow); the 3 earlier corner-case deferrals were resolved Known limitations (pinned by tests, not fixed here) - 32-bit silent truncation in _mk_bus / _encode_alloc_mr: MR length >= 2**32 is masked to zero by the (1 << width) - 1 field mask. The FPGA-side RoceEngine protocol is 32-bit and the production path stays four orders of magnitude below the 4 GiB boundary. - Python does not enforce the QP state machine: illegal transitions are detected only via the FPGA's decoded response (success=False plus the current state). The firmware is the authoritative state machine; a Python mirror would duplicate its enforcement and risk drifting out of sync. Stability fixes (surfaced while stabilising CI, unrelated to RoCEv2) - UDP packetizer integration test: enlarged SO_RCVBUF and widened the stall window for macOS arm64 - Stream-variable test: settled the enable-frame burst before sampling initial_count - RSSI retran timeout bumped on the loopback clean-path and UDP-packetizer loopback tests to absorb GHA runner jitter Sphinx docs - 8 new .rst files integrated into the existing protocol toctrees alongside RSSI/SRP/Packetizer - make html builds clean with zero RoCEv2-related warnings - Jumbo-frame MTU, SoftRoCE rdma_rxe workflow, and surf RoceEngine dependency documented No existing protocol APIs, stream/memory interfaces, or PyRogue device tree are changed — RoCEv2 itself is purely additive; the only behavioural change to existing code is enabling --coverage when ROGUE_COVERAGE=ON. C++ and Python source ported from FilMarini/rogue:rocev2. Attribution comments preserved in each ported file pointing back to FilMarini/rogue:rocev2. Slaclab standards take precedence over API compatibility; deltas documented in module docstrings.
Corner case #1 (MR length overflow): add tests for the new _start() guard (test_start_rejects_mr_length_overflow, parametric accept-side boundary), update encoder-level test docstrings to note the upstream guard, update resource-bounds section comment with Part D. Corner case #2 (QP state machine): update comment block with full design rationale — FPGA is single source of truth, no Python-side mirror added. Add "Reprogram the FPGA" assertions to PD/MR/QP failure tests to pin the new stale-resource error messages. Add Troubleshooting section to RoCEv2 Sphinx docs covering stale-resource errors, MR length guard, and QP state-machine enforcement.
Non-jumbo MTU (1500) splits each 2048-byte frame into 2 RSSI segments vs 1 for jumbo, doubling the segment traffic through the out-of-order stage. With period=10, this overwhelms RSSI's retransmit machinery on macOS arm64 CI runners under pytest-xdist contention, exhausting the retransmit budget and killing the session before any application frames are delivered (stall at 0/200). Use period=50 for non-jumbo links (8 reordered segments out of 400) which still validates out-of-order handling without exceeding the RSSI window budget. Jumbo links keep period=10 (20 reordered out of 200).
Override XDG_SESSION_TYPE to x11 before Qt initialization so Qt does not emit the noisy "Ignoring XDG_SESSION_TYPE=wayland on Gnome" warning. The override is conditional and only fires on Wayland sessions.
…mpatibility SqlReader used deprecated 1.x APIs (bound MetaData, autoload=True, list-wrapped select) and referenced an undefined self._conn attribute. Replace with 2.x-compatible patterns (unbound MetaData, autoload_with, engine.connect() context managers) that also work on 1.4+. Add round-trip tests that write via SqlLogger and read via SqlReader.
…1/V2 SSI ordering - Application / Transport: raw std::thread* member -> std::unique_ptr<std::thread>; remove manual new / delete. - setController(): wrap std::thread spawn in try/catch; roll back threadEn_ and cntl_ if construction throws so the instance stays in its retryable pre-call state. Idempotent guard so a second call is a no-op (without it, replacing a still-joinable unique_ptr<std::thread> would invoke std::terminate via ~thread()). - ControllerV1: move SSI setError(0x80) to BEFORE pushFrame() and tranFrame_[0].reset(), and apply it to tranFrame_[0] (the assembled frame) instead of tranFrame_[tmpDest]. Original code dereferenced a null shared_ptr for any tmpDest != 0, and silently dropped the SSI flag for tmpDest == 0. - ControllerV2: same setError ordering fix - apply before push + reset so the downstream slave observes the error flag instead of the call hitting a just-reset (null) slot.
…s atomic Concurrent first-callers could observe partially-initialised state. Declares Version::init()'s statics atomic and guards the one-time initialisation with std::call_once.
Remove tests/protocols/test_batcher_audit_repros.py — the file walked InverterV1.cpp / InverterV2.cpp source text looking for bounds-check tokens before the memcpy, rather than functionally exercising the patched code paths. Per maintainer feedback that style of test is too brittle to be useful as CI; the audit/guardrail intent is better expressed as an AI agent directive at PR-review time.
…for transaction completion pr.startTransaction() accepts a `check` keyword argument; when True, the underlying rim::Block::startTransaction calls checkTransaction() to wait for the slave to acknowledge the transaction before returning. Several callers in pyrogue passed `checkEach=True` instead, which is silently dropped into the function's **kwargs (the docstring marks them "unused"). The wait was therefore disabled, allowing back-to-back operations (notably BaseCommand.toggle which issues set(1); set(0)) to race the shared blockData_ buffer: the second caller mutates the buffer before the slave's worker has consumed the data pointer for the first transaction, the slave observes the post-mutation value for both, and the hardware never sees the rising edge. In the field this manifests as "status counter reset never fires" on PGPv4 and any other RemoteCommand that uses BaseCommand.toggle against an asynchronous memory slave (SRP-over-UDP/TCP, network bridges, DMA-backed AXI masters). The typo was introduced by cff241e (2020-08-11, "Fix command and poll queue") on RemoteCommand.set/get, and the same pattern was cloned into RemoteVariable.post and Device.{write,read,verify}Blocks. The block-list helpers in _Block.py (writeBlocks, readBlocks, verifyBlocks, writeAndVerifyBlocks, readAndCheckBlocks) correctly translate `checkEach=` to `check=` and are not affected. Fixed call sites (all switched from `checkEach=` to `check=` when calling pr.startTransaction): _Command.py:557 RemoteCommand.set _Command.py:583 RemoteCommand.get _Variable.py:1636 RemoteVariable.post _Device.py:534 Device.writeBlocks variable path _Device.py:539 Device.writeBlocks block-iteration path _Device.py:569 Device.verifyBlocks variable path _Device.py:574 Device.verifyBlocks block-iteration path _Device.py:607 Device.readBlocks variable path _Device.py:612 Device.readBlocks block-iteration path Regression coverage: tests/core/test_transaction_check_kwarg_race.py adds 9 contract-level tests (one per fixed site, asserting the correct `check=` kwarg propagates to pr.startTransaction) and one end-to-end behavioral test that exercises BaseCommand.toggle against an asynchronous in-memory slave. The behavioral test mimics real network/DMA slaves by deferring transaction.getData() to a worker thread; with the fix the slave records [(addr, 1), (addr, 0)], without the fix the slave records [(addr, 0), (addr, 0)]. An existing test in tests/core/test_core_structure_edge_cases.py was reading `kwargs.get("checkEach")` in its monkeypatch of pr.startTransaction, which silently captured None regardless of the user-supplied checkEach value (it was implicitly validating the bug). Updated to read `kwargs.get("check")` so it correctly verifies that Device.forceCheckEach=True propagates to the transaction layer.
…with `wait` - Introduced `_warn_deprecated` and `_resolve_deprecated_bool` helper functions to manage deprecated argument warnings and resolve conflicts between new and old argument names. - Updated `BaseVariable`, `RemoteVariable`, and `LocalVariable` classes to replace `check` with `wait` in method signatures and implementations. - Modified transaction-related methods in `PollQueue`, `Root`, and other classes to use `wait` instead of `check`. - Updated tests to ensure compatibility with the new `wait` argument and verify that deprecated `check` arguments trigger appropriate warnings. - Ensured that all relevant documentation and comments reflect the changes from `check` to `wait`.
ci: bump GitHub Actions to latest majors (Node 20 deprecation)
… and Block operations
…entation and code comments
Add RoCEv2 protocol module with Soft-RoCE CI coverage and hardening
…lab/rogue into RemoteCommand-checkEach-fix
PR #1232 (ESROGUE-711) fixed _varDone() pickling the live _updateList dict by wrapping _varUpdate/_varDone in a threading.Lock plus a snapshot-and-swap. The per-variable lock on _varUpdate is the hot path (one call per variable per update group, on the already-serializing ZMQ flush), which drew an "overly aggressive locking" concern on the merge. Replace the shared dict + lock with a collections.deque: - _varUpdate appends (path, value) -- a single GIL-atomic, lock-free op. - _varDone drains the deque into a thread-private dict, then pickles that private snapshot unlocked. pickle.dumps only ever iterates the frame-local snapshot, so a concurrent _varUpdate (appending to the deque) can no longer mutate the object being serialized -- "dictionary changed size during iteration" becomes structurally impossible rather than lock-guarded. Published payloads are unchanged (last-write-wins per path, first-occurrence key order). The existing multithread race tests remain valid and pass; the slow-pickle reproducer is deterministically green. stream.interfaces.Variable is left untouched (PyYAML's represent_mapping snapshots via list(mapping.items()), so it was never exploitable).
…imeout(warnTime, failTime) Collapse the three coupled recv-retry knobs introduced in #1237 into a warn/fail time pair that mirrors the memory Transaction model: void setTimeout(uint32_t warnTime, uint32_t failTime = 0); - warnTime drives ZMQ_RCVTIMEO and the periodic "still waiting" warning. - failTime is an absolute wait deadline (ms); the recv loops in send()/ sendString() throw once the accumulated wait reaches it. failTime == 0 (default) installs no deadline and retries forever, preserving the historic #1236 contract. - stop() now sets an atomic stopping_ sentinel to break an in-flight forever loop, replacing the old waitRetry_ = false unblock. The Python binding rejects bool arguments with a TypeError so stale (msecs, waitRetry) callers fail loudly instead of silently coercing True to failTime = 1. _Virtual.py: connection test (1000, False) -> (1000, 1000); operational (1000, True) -> (1000, 0). Tests updated to the warn/fail contract.
…ock_ Address PR review feedback: - setTimeout() now rejects warnTime == 0. A 0 ms warn period sets ZMQ_RCVTIMEO to 0 (non-blocking), so the send()/sendString() recv loop would busy-spin without ever advancing toward failTime. - stop() acquires reqLock_ before closing zmqReq_ and destroying the context, serializing teardown against an in-flight send()/sendString(). The stopping_ sentinel (set first) guarantees an in-flight loop releases reqLock_ within one RCVTIMEO period, so this cannot deadlock. Add a regression test for the zero warnTime rejection.
Add deleted setTimeout(uint32_t, bool) and setTimeout(uint32_t, bool, uint32_t) overloads so legacy C++ calls like setTimeout(1000, true) fail to compile instead of silently converting the bool to uint32_t and setting failTime = 1 ms. The Python binding already rejects bool arguments at runtime via the setTimeoutPy wrapper.
…y CI discovery
Build every p4p client bound directly to the running EpicsPvServer
instead of a bare Context('pva') that relies on UDP broadcast/search
discovery. That discovery path is unreliable over loopback on CI
runners (the same limitation that hard-skips this test on macOS ARM64),
causing intermittent 'last=unavailable' timeouts.
make_pva_client() reads the server's actual bound TCP port from
Server.conf() and points the client at it via EPICS_PVA_NAME_SERVERS,
which performs the channel search over a direct TCP connection with no
UDP involved. AUTO_ADDR_LIST=NO, an empty ADDR_LIST, and useenv=False
keep ambient EPICS_* settings from re-introducing UDP discovery.
Reading the real port also tolerates the EPICS default ports being
unbindable on CI, where the server falls back to an ephemeral port.
A fresh client is built per server instance; the main client and both
RPC clients use context managers so each is closed before its server is
torn down. The helper raises a clear RuntimeError (converted to a
pytest.skip) if the server has not started or exposes no server port.
Test-only change; no production code, public API, or documentation is
affected.
fix(tests): connect epicsV4 PVA client to server TCP port to fix flaky CI discovery
ZmqClient: replace setTimeout(msecs, waitRetry, maxRetries) with setTimeout(warnTime, failTime)
fix(zmq): replace ZmqServer update lock with a lock-free deque
Fix RemoteCommand, RemoteVariable.post, and Device block ops to wait for transaction completion
Document agent build environment guidance
) Block::getBytes and Block::setBytes released the Python GIL unconditionally on every call. During a bulk update-queue drain (~42k staged variable updates) that is ~42k PyEval_SaveThread/PyEval_RestoreThread cycles; under GIL contention each hand-off costs milliseconds, stalling the drain for tens of seconds -- the v6 regression reported in #1262. The method bodies are pure in-memory work (memcpy/copyBits/byte-reverse) with no Python calls, so the GIL only needs to be released while waiting on a contended block mutex -- the case that could otherwise deadlock against a memory-transaction worker holding the mutex while needing the GIL for a Python callback. Acquire the mutex with try_to_lock and drop the GIL only on the contended fallback path, keeping the GIL on the uncontended fast path. Adds a perf reproduction (tests/perf) driving a 42k staged set/get drain under GIL contention, and a deterministic round-trip correctness test (tests/core) covering the scalar, list, and byte-reversed paths plus the try_to_lock fallback branch.
fix(memory): make Block getBytes/setBytes GIL release conditional
slacrherbst
approved these changes
Jun 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description