Skip to content

Release Candidate v6.13.1#1240

Closed
ruck314 wants to merge 28 commits into
mainfrom
pre-release
Closed

Release Candidate v6.13.1#1240
ruck314 wants to merge 28 commits into
mainfrom
pre-release

Conversation

ruck314 and others added 26 commits May 6, 2026 09:18
Override XDG_SESSION_TYPE to x11 before Qt initialization so Qt does not
emit the noisy "Ignoring XDG_SESSION_TYPE=wayland on Gnome" warning.
The override is conditional and only fires on Wayland sessions.
…mpatibility

SqlReader used deprecated 1.x APIs (bound MetaData, autoload=True,
list-wrapped select) and referenced an undefined self._conn attribute.
Replace with 2.x-compatible patterns (unbound MetaData, autoload_with,
engine.connect() context managers) that also work on 1.4+.

Add round-trip tests that write via SqlLogger and read via SqlReader.
…1/V2 SSI ordering

- Application / Transport: raw std::thread* member -> std::unique_ptr<std::thread>;
  remove manual new / delete.
- setController(): wrap std::thread spawn in try/catch; roll back threadEn_ and
  cntl_ if construction throws so the instance stays in its retryable
  pre-call state. Idempotent guard so a second call is a no-op (without it,
  replacing a still-joinable unique_ptr<std::thread> would invoke
  std::terminate via ~thread()).
- ControllerV1: move SSI setError(0x80) to BEFORE pushFrame() and
  tranFrame_[0].reset(), and apply it to tranFrame_[0] (the assembled
  frame) instead of tranFrame_[tmpDest]. Original code dereferenced a
  null shared_ptr for any tmpDest != 0, and silently dropped the SSI
  flag for tmpDest == 0.
- ControllerV2: same setError ordering fix - apply before push + reset
  so the downstream slave observes the error flag instead of the call
  hitting a just-reset (null) slot.
…s atomic

Concurrent first-callers could observe partially-initialised state.
Declares Version::init()'s statics atomic and guards the one-time
initialisation with std::call_once.
Remove tests/protocols/test_batcher_audit_repros.py — the file walked
InverterV1.cpp / InverterV2.cpp source text looking for bounds-check
tokens before the memcpy, rather than functionally exercising the
patched code paths. Per maintainer feedback that style of test is too
brittle to be useful as CI; the audit/guardrail intent is better
expressed as an AI agent directive at PR-review time.
fix(version): Version::init() race fix
fix(pyrogue::interfaces): SQLAlchemy 1.4+/2.x compatibility for SqlReader
Suppress Qt Wayland warning in PyDM launcher
fix(batcher): InverterV1, InverterV2 bounds-check audit fixes + repros
fix(packetizer): RAII worker threads, exception-safe setController, V1/V2 SSI ordering
ruck314 added a commit that referenced this pull request May 28, 2026
Add an opt-in `nonBlocking=True` kwarg on `BaseCommand` so long-running
commands invoked over ZMQ (`SimpleClient.exec` / `VirtualClient`) no
longer tie the client up past `linkTimeout` / `requestStallTimeout`.
`pr.Process` and `Command(nonBlocking=True)` are now the two canonical
patterns for long-running work, documented side-by-side.

Core API (python/pyrogue/_Command.py, python/pyrogue/_Device.py)

- BaseCommand(nonBlocking=True) opt-in kwarg.  Default behaviour
  unchanged -- sync call() still returns the function's value.
- RemoteCommand explicitly forwards nonBlocking to BaseCommand.__init__
  only; it is not in **kwargs and therefore cannot reach
  RemoteVariable.__init__.
- Worker thread per call (daemon=False, name "{path}.worker").
  Auto-injects three sibling LocalVariables on the parent Device:
  {name}Running (bool), {name}Result (last return value, sticky on
  success), {name}Error (last exception string, cleared at dispatch).
- Re-entry guard uses live thread aliveness (self._thread.is_alive()),
  not the _runEn flag, closing the race where _stopWorker cleared the
  flag before joining.
- Sync-path parity: when the callback accepts an arg and this is a
  LocalCommand, the resolved argument is cached into self._default and
  an update is queued before worker dispatch, so cmd.get() reflects
  the last invocation.
- Sibling-injection collision guard: pre-checks parent._nodes for
  {name}Running / {name}Result / {name}Error before any mutation;
  raises pr.NodeError on conflict so a third-sibling collision never
  leaves a half-injected state.
- Device._stop() joins non-blocking command workers BEFORE stopping
  managed interfaces, mirroring Process._stop calling _stopProcess
  first.  Worker functions that touch interfaces or memory during
  teardown stay safe.
- Device._rootAttached snapshots _nodes before iterating so the
  sibling-variable injection done by child BaseCommand._rootAttached
  cannot raise RuntimeError: dictionary changed size during iteration.
- Worker lifecycle (_runWorker, _stopWorker, _stop) mirrors pr.Process,
  including the self-thread guard that prevents deadlock if the user
  function triggers Root.stop() from within itself.

Cookbook example (python/pyrogue/examples/_NonBlockingCommandExample.py)

- ApplyConfig non-blocking LocalCommand plus SimpleClient hand-rolled
  Running / Result / Error poll loop.
- main() raises TimeoutError on deadline expiry to match its documented
  behaviour.
- Exported from the pyrogue.examples barrel (examples/__init__.py).
- Runnable headlessly (python -m
  pyrogue.examples._NonBlockingCommandExample); also used as the
  literalinclude source for the docs polling-pattern snippet.

Documentation (docs/src/)

- New Sphinx page
  pyrogue_tree/builtin_devices/long_running_operations.rst -- decision
  rule, side-by-side comparison table (pr.Process vs
  Command(nonBlocking=True)), parallel recipes, client-side polling
  pattern, and an explicit "Interaction with linkTimeout and
  requestStallTimeout" section explaining that linkTimeout is an idle
  timeout and that requestStallTimeout defaults to disabled.
- versionadded:: 6.14.0 on the new page and on the nonBlocking
  parameter in BaseCommand's docstring.
- :ref: cross-references inside BaseCommand.nonBlocking (_Command.py)
  and VirtualClient.linkTimeout / requestStallTimeout (_Virtual.py) so
  autoclass and help() surface the link next to the relevant parameter.
- Related Topics / What To Explore Next cross-links added to
  process.rst, core/command.rst, core/local_command.rst,
  client_interfaces/virtual.rst, client_interfaces/simple.rst, and
  builtin_devices/index.rst so a #1243-style reader lands on the new
  page from any entry point.

Tests

- tests/core/test_core_command_nonblocking_construction.py --
  backward-compat construction, RemoteCommand(nonBlocking=True)
  attaches cleanly, sibling-injection collision raises pr.NodeError.
- tests/interfaces/test_interfaces_command_nonblocking.py --
  happy-path, re-entry rejection, Root.stop() clean-join (no .worker
  thread survives shutdown), error path (str(exc) in Error,
  Running=False, Result unchanged).
- tests/interfaces/test_interfaces_simple_client_nonblocking.py --
  SimpleClient.exec returns in well under 1 s for a multi-second
  non-blocking command; polls ApplyConfigRunning via full Top.Dev.*
  paths.

Compatibility

- Default BaseCommand behaviour unchanged.  nonBlocking=True is opt-in.
- No new C++ dependencies.  Pure-Python worker, mirroring pr.Process.
- No changes to existing examples or downstream consumers
  (python/pyrogue/examples/, python/pyrogue/pydm/) unless explicitly
  migrated.
- Targeted at v6.14.0 -- does not block RC #1240 / v6.13.1.
ruck314 added 2 commits May 28, 2026 00:13
Root.waitOnUpdate() previously called Queue.join(), which only returns
when the unfinished-task counter reaches zero. updateGroup() is
per-thread, so under sustained producer load (poll thread,
listener-fired sets, UDP listeners) the counter never settles and the
call could silently take minutes - long enough to trip the ZMQ client
linkTimeout. Reported by Tristan on the LAT/SO rogue 6 deployments,
where `with updateGroup(): waitOnUpdate()` occasionally took >2 min
between the metadata-frame setup and the open_g3stream trigger.

Replace Queue.join() with a barrier: push a threading.Event sentinel
into the update queue and wait for the worker to set it. The worker
recognizes the sentinel, signals it, calls task_done(), and continues.
Items queued during the wait are not included - the wait is bounded
by queue contents at the moment of the call, which is exactly the
flush-before-next-set semantics callers actually need.

Race guard: stop() can flip _running and enqueue the None stop
sentinel between the _running check and the barrier put(). If the
barrier ends up behind the stop sentinel the worker exits before
reaching it, leaving barrier.wait() blocked forever. Wait with a
short timeout and re-check _running so the call returns after stop().

Adds tests/core/test_wait_on_update_bounded.py (5 cases) covering
idle queue, concurrent producer thread, the with updateGroup() user
pattern, listener self-storm, and the listener-callback-drains
contract. The listener-self-storm case re-fires the same variable
from its own listener (gated by a stop event) so each callback queues
another update from the worker thread, exercising the unbounded
feedback loop the bounded barrier is meant to absorb. The two
producer-thread cases hang (pytest 15s timeout) on unfixed code and
pass in ~5s after the fix; full tests/core/ (273 tests) and
tests/interfaces/ (116 + 5 skipped) green.
fix(pyrogue): bound waitOnUpdate() with a barrier sentinel
@ruck314

ruck314 commented May 31, 2026

Copy link
Copy Markdown
Contributor Author

Closing this for now (reviewing whether the feature is needed before moving forward)

@ruck314 ruck314 closed this May 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant