Skip to content

test(mesh): pin runtime domain-tag isolation for resume replay cache (#264)#340

Open
cagataycali wants to merge 2 commits into
strands-labs:mainfrom
cagataycali:fix/issue-264-pin
Open

test(mesh): pin runtime domain-tag isolation for resume replay cache (#264)#340
cagataycali wants to merge 2 commits into
strands-labs:mainfrom
cagataycali:fix/issue-264-pin

Conversation

@cagataycali

@cagataycali cagataycali commented Jun 5, 2026

Copy link
Copy Markdown
Member

Summary

Issue #264 asks the resume replay cache key to be domain-tagged so a Zenoh wire_zid (hex, TLS-bound) and a bridge-transport issuer_id (app metadata) that happen to share the same string cannot collide into one cache slot.

The fix is already in strands_robots/mesh/core.py (_on_safety_resume):

issuer_key = ("wire", wire_zid) if wire_zid is not None else ("body", issuer_id)
cache_key = (issuer_key, proof_nonce)

But the existing pins (TestResumeCacheKeyNamespaceIsolation) were source-introspection only. This PR adds two runtime behavioral pins.

Tests added (tests/mesh/test_resume_replay.py)

  • test_bridge_peer_id_cannot_evict_zenoh_wire_zid_resume_slot: a bridge resume with peer_id='f00b' does not pre-occupy the slot of a later Zenoh resume with wire_zid='f00b' sharing the same proof_nonce; the Zenoh resume clears its lockout normally. Verified to FAIL pre-fix (conflating wire_zid or issuer_id key) and PASS post-fix.
  • test_zenoh_wire_zid_resume_replay_within_domain_still_rejected: in-domain replay of a Zenoh (wire_zid, proof_nonce) is still rejected (domain tag does not weaken replay defence).

Test output

hatch run python -m pytest tests/mesh/test_resume_replay.py -q
============================== 18 passed in 0.94s ==============================

Pre-fix verification (key reverted to old conflating shape):

hatch run python -m pytest tests/mesh/test_resume_replay.py -k bridge_peer_id_cannot_evict -q
======================= 1 failed, 17 deselected in 0.95s =======================

closes #264


§13 Review Rounds

Round Concern Fix Commit Pin Test
Rebase Conflict resolution: test_resume_replay.py hunk overlap with #342 (per-issuer fairness cap tests) -- both test sets kept, ruff-formatted 2bcf094 all 18 tests in tests/mesh/test_resume_replay.py

@yinsong1986 yinsong1986 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Tests-only PR (+107/-0 in tests/mesh/test_resume_replay.py) that converts the existing source-introspection pin for issue #264 (TestResumeCacheKeyNamespaceIsolation in test_pr6_review_pins.py) into two runtime behavioural pins. The first asserts that a bridge-transport resume keyed ("body", "f00b") cannot pre-occupy the slot of a Zenoh resume keyed ("wire", "f00b") sharing the same proof_nonce; the second asserts in-domain replay defence is preserved for (wire_zid, proof_nonce). Both tests' assertions match the production key shape on core.py:2076-2077 (issuer_key = ("wire", wire_zid) if wire_zid is not None else ("body", issuer_id)), and the helper _make_zenoh_envelope correctly mirrors the MAC field set the receiver re-derives when a sample carries source_id.zid (binds source_zid into the HMAC input alongside peer_id/t/lockout_elapsed_s/proof_nonce).

Verified locally on 8f6408bb: hatch run python -m pytest tests/mesh/test_resume_replay.py -q -> 16 passed.

What's good

  • Direct execution of the AGENTS.md > Review Learnings (PR #85) > Testing rule: "Pin regression tests for reviewed fixes -- every review fix gets a test that fails on pre-fix code." PR description shows the pre-fix verification (bridge_peer_id_cannot_evict fails when the key is reverted to the conflating shape), which is exactly the fail-on-pre-fix-code contract.
  • Scope discipline: tests-only diff, no production code changes, no public-API surface touched. Zero one-way-door risk for v0.4.0.
  • Tests assert on observable behaviour (_estop_lockout.is_set()) AND on the cache key shape ((("body", id), nonce) in m._resume_replay_cache), so the same test catches both a behavioural regression and a silent key-shape refactor. Belt-and-braces here is appropriate because the two existing source-introspection guards in test_pr6_review_pins.py would silently start passing if the cache was renamed or moved.
  • The Zenoh fixture (_zenoh_sample, _make_zenoh_envelope) matches the same MagicMock-based style as the pre-existing _sample/_make_envelope helpers; no new dependency footprint.

Must fix before merge

(none -- PR is ready to merge once any follow-ups are tracked)

Follow-up in v0.4.1

  • Helper hygiene: _make_zenoh_envelope and _make_envelope (pre-existing) duplicate the JSON-encoded MAC construction with the only difference being whether source_zid is bound. Worth collapsing into a single _make_envelope(..., wire_zid=None) helper in a follow-up cleanup PR -- not blocking, since the duplication is local to the test module and matches the asymmetry on the receiver side. (tests/mesh/test_resume_replay.py:41, :402).
  • Pin the pre-fix demonstration: PR description shows pytest -k bridge_peer_id_cannot_evict failing on a manually-reverted core.py, but that one-shot is not captured in CI. The existing source-introspection guards in tests/mesh/test_pr6_review_pins.py:357 (TestResumeCacheKeyNamespaceIsolation) already block re-introduction of the wire_zid or issuer_id pattern, so this is double-covered today; track as a low-priority issue if the structural guard is ever removed.

Verification suggestions

Standard CI is sufficient. Spot-check locally with:

hatch run python -m pytest tests/mesh/test_resume_replay.py tests/mesh/test_pr6_review_pins.py -q

Both files together cover the structural guard + the two new runtime pins.

@yinsong1986 yinsong1986 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Adds two runtime behavioural pins for the domain-tagged resume-replay cache key fix in Mesh._on_safety_resume (issue #264). Pre-fix the cache key was (wire_zid or issuer_id, proof_nonce), which conflated a Zenoh wire_zid and a bridge issuer_id that happened to share the same hex string into one cache slot. The fix in strands_robots/mesh/core.py:2103 keys on (("wire"|"body", id), proof_nonce). The new tests pin both the cross-domain isolation (test_bridge_peer_id_cannot_evict_zenoh_wire_zid_resume_slot) and the within-domain replay defence (test_zenoh_wire_zid_resume_replay_within_domain_still_rejected).

Verified locally: 16/16 pass on HEAD; reverting the cache-key shape to the old (wire_zid or issuer_id, proof_nonce) form makes test_bridge_peer_id_cannot_evict_zenoh_wire_zid_resume_slot fail as the PR description claims. Pre-fix breakage reproduces exactly.

What's good

  • Test-only diff (+109 / -0), scope-disciplined.
  • Pre-fix verification reproduced on this box; the test bites the documented bug.
  • Pins both halves of the contract: cross-domain MUST be isolated AND within-domain MUST still reject replays. A future regression that flattens the key in either direction will trip one of these.
  • Asserts on observable side-effects (_estop_lockout.is_set()) AND on the cache key tuple shape, so the AGENTS.md "pin regression tests for reviewed fixes" rule is satisfied: the next refactor cannot silently restore the conflation.
  • Fixture _make_zenoh_envelope mirrors the receiver-side MAC contract (source_zid in MAC inputs and body), so the test exercises the real verify path rather than short-circuiting it. The _zenoh_sample helper produces strings that pass _extract_sample_source_zid's ^[0-9a-f]{1,32}$ guard.
  • ASCII-only, no host paths, no emojis.

Must fix before merge

(none — PR is ready to merge once any follow-ups are tracked)

Follow-up in v0.4.1

  • Module docstring stale post-#264. tests/mesh/test_resume_replay.py lines 1–13 still describe the cache as keyed on ((issuer_peer_id, proof_nonce) tuple). After this PR's fix the production key is (("wire"|"body", id), proof_nonce). The PR didn't introduce the drift but is the natural moment to refresh the bullet. Pure docs polish; not blocking. Track as a small docs PR or fold into the next mesh-touching change.
  • Test suite reaches deeply into private mesh state. m._resume_replay_cache, m._estop_lockout, m._on_safety_resume(...) are all underscore-prefixed. The two new tests follow the file's existing convention so the diff is consistent, but the larger pattern is worth a tracking issue: AGENTS.md "Test behavior, not implementation" is in partial tension with assertions on internal cache-key tuple shape. Current shape is justified (the cache key is the contract being fixed) but if _resume_replay_cache evolves — per-receiver scoping, HMAC fingerprinting, etc. — every shape assertion across the file will need to be updated in lockstep.

Verification suggestions

Standard CI is sufficient. To independently spot-check the pre-fix verification on a local checkout:

# revert the domain-tagged key shape locally
python -c "import re,sys;p='strands_robots/mesh/core.py';s=open(p).read();open(p,'w').write(s.replace('issuer_key = (\"wire\", wire_zid) if wire_zid is not None else (\"body\", issuer_id)\n        cache_key = (issuer_key, proof_nonce)','cache_key = (wire_zid or issuer_id, proof_nonce)'))"
hatch run python -m pytest tests/mesh/test_resume_replay.py::test_bridge_peer_id_cannot_evict_zenoh_wire_zid_resume_slot -q  # expect FAIL
git checkout -- strands_robots/mesh/core.py
hatch run python -m pytest tests/mesh/test_resume_replay.py -q  # expect 16 passed

Comment thread tests/mesh/test_resume_replay.py
Comment thread tests/mesh/test_resume_replay.py
@cagataycali cagataycali added mesh Zenoh mesh networking / fleet coordination security quality labels Jun 6, 2026

@yinsong1986 yinsong1986 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Tests-only PR (+109/-0 in tests/mesh/test_resume_replay.py) that pins runtime behavior of the domain-tagged resume replay cache key shipped in core.py:_on_safety_resume. Two new tests:

  1. test_bridge_peer_id_cannot_evict_zenoh_wire_zid_resume_slot — the regression pin for #264. Same id string (f00b) on the bridge body-peer side and the Zenoh wire side with a shared proof_nonce must occupy distinct cache slots; the PR description verifies this fails pre-fix and passes post-fix.
  2. test_zenoh_wire_zid_resume_replay_within_domain_still_rejected — the symmetry pin: the new domain tag must not weaken in-domain replay defence.

Verified locally: python -m pytest tests/mesh/test_resume_replay.py -q -> 16 passed in 2.88s. Helpers _make_zenoh_envelope and _zenoh_sample correctly mirror the production MAC construction (source_zid included only when wire_zid is not None, matching core.py:2052-2058) and the _ZENOH_ZID_PATTERN = ^[0-9a-f]{1,32}$ constraint at core.py:229 (both "f00b" and "abcd1234" are valid hex digests of 1..32 chars).

What's good

  • Textbook regression-pin shape per AGENTS.md > Review Learnings (#85) > 'Pin regression tests for reviewed fixes' — both the bug-shape (cross-domain collision) and the symmetry case (in-domain replay still rejected) are pinned.
  • Pre-fix failure is explicitly demonstrated in the PR body (1 failed, 15 deselected after reverting the key shape) — this is the bar AGENTS.md asks for and PRs often skip.
  • Scope discipline: no production code touched, no public API / wire format / persisted state changes, no one-way doors.
  • Uses monkeypatch.setenv per AGENTS.md > Review Learnings (#86) > 'Testing Patterns'.
  • No host paths, no emojis, no orphan combining marks.
  • MAC-binding fields in _make_zenoh_envelope exactly mirror the receiver-side derivation at strands_robots/mesh/core.py:2046-2063 — this is the part most regression pins get wrong.

Must fix before merge

(none — PR is ready to merge once any follow-ups are tracked)

Follow-up in v0.4.1

  • tests/mesh/test_resume_replay.py:413import json as _json is a re-import of the module-level import json already at line 17. Mirrors the existing style in _make_envelope so harmless, but a one-line cleanup that drops the duplicate alias from both helpers would tidy the module. Pure style; track only if a broader test-helper pass happens.
  • tests/mesh/test_resume_replay.py:464,488 — both new tests assert on the precise tuple shape stored in the private m._resume_replay_cache dict. This is intentional (the key-shape IS the #264 fence) but it couples the test to an internal data structure; a follow-up could expose a small read-only inspector (Mesh._resume_cache_contains(issuer_key, nonce)) so future refactors of the cache representation don't force test churn. Not blocking.
  • _make_zenoh_envelope / _zenoh_sample helpers will likely be reused by future mesh tests covering wire-zid binding (e.g. #266 estop-cache symmetry, any future source_zid work). Promoting them to tests/mesh/conftest.py next time another test file needs them avoids a copy-paste fork.

Verification suggestions

Standard CI is sufficient. For paranoia, the pre-fix verification claim can be reproduced by reverting strands_robots/mesh/core.py:2083 to the conflating shape (issuer_key = wire_zid or issuer_id) and re-running:

hatch run python -m pytest tests/mesh/test_resume_replay.py::test_bridge_peer_id_cannot_evict_zenoh_wire_zid_resume_slot -q

Must report 1 failed pre-fix, 1 passed post-fix. The PR description shows this run; spot-check is optional.

Comment thread tests/mesh/test_resume_replay.py
Comment thread tests/mesh/test_resume_replay.py
Comment thread tests/mesh/test_resume_replay.py
@cagataycali

Copy link
Copy Markdown
Member Author

Status: keeping this open — confirmed distinct value, not redundant.

Verified against strands_robots/mesh/core.py on main: the domain-tagged resume cache key this PR pins is already live in _on_safety_resume:

# Domain-tagged key prevents namespace collision between Zenoh
# wire_zid (hex, TLS-bound) and body issuer_id (app metadata).
issuer_key = ("wire", wire_zid) if wire_zid is not None else ("body", issuer_id)
cache_key = (issuer_key, proof_nonce)

That means this PR is tests-only regression coverage for a behavior already in main (wire/body key-collision fence, #264) — a genuinely different property from the per-issuer fairness cap that #342 merged (and which made the now-closed #341 redundant). The reviewer's ready-to-merge-once-follow-ups-tracked assessment still stands.

The CONFLICTING status is therefore almost certainly textual drift in tests/mesh/test_resume_replay.py (#342 added 79 lines to the same test file), not a behavioral conflict. The tests should pass against main once rebased.

Next action to unblock: rebase fix/issue-264-pin onto main, resolving the test_resume_replay.py hunk overlap (additive on both sides — keep both test sets), then re-request review. This is the highest-value follow-up; it converts an approved-modulo-tracking PR into a mergeable one with real coverage.

cagataycali and others added 2 commits June 9, 2026 15:57
…trands-labs#264)

The resume replay cache key is domain-tagged ("wire", wire_zid) vs
("body", issuer_id) in _on_safety_resume so a Zenoh wire_zid and a
bridge issuer_id sharing the same string never collide. Existing pins
were source-introspection only. Add two runtime behavioral pins:

- test_bridge_peer_id_cannot_evict_zenoh_wire_zid_resume_slot: a bridge
  resume with peer_id=f00b does not pre-occupy the slot of a later
  Zenoh resume with wire_zid=f00b sharing the same proof_nonce; the
  Zenoh resume clears its lockout. Fails pre-fix (conflating
  wire_zid or issuer_id key), passes post-fix.
- test_zenoh_wire_zid_resume_replay_within_domain_still_rejected:
  in-domain replay of a Zenoh (wire_zid, proof_nonce) is still rejected.

closes strands-labs#264
Whitespace-only normalization to resolve the format --check failure.
No logic change.

@yinsong1986 yinsong1986 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Tests-only PR (+109/-0 in tests/mesh/test_resume_replay.py) that adds two runtime behavioural pins for the domain-tagged resume-replay cache key in Mesh._on_safety_resume (issue #264). The first asserts a bridge-transport resume keyed ("body", id) cannot pre-occupy the slot of a Zenoh resume keyed ("wire", id) sharing the same proof_nonce; the second asserts in-domain replay of a (wire_zid, proof_nonce) pair is still rejected. Author reports pre-fix FAIL / post-fix PASS, which is exactly the regression-pin shape AGENTS.md > Review Learnings (#85) > Testing > 'Pin regression tests for reviewed fixes' calls for.

What's good

  • Pre-fix verification documented in the PR body (one test fails on the conflating key, all 18 pass on the fixed key).
  • Helpers (_zenoh_sample, _make_zenoh_envelope) faithfully mirror what _extract_sample_source_zid reads and rebind the MAC to include source_zid, so the pins exercise the real receiver path rather than a stub.
  • Scope discipline: no production code touched; existing TestResumeCacheKeyNamespaceIsolation source-introspection pin is retained alongside the new runtime pins.

Must fix before merge

No blocking concerns found.

@cagataycali cagataycali moved this from Backlog to In review in Strands Labs - Robots Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

mesh Zenoh mesh networking / fleet coordination quality security

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

mesh(safety): prefix resume cache keys with domain tag to prevent wire_zid/issuer_id namespace collision

2 participants