test(mesh): pin runtime domain-tag isolation for resume replay cache (#264)#340
test(mesh): pin runtime domain-tag isolation for resume replay cache (#264)#340cagataycali wants to merge 2 commits into
Conversation
yinsong1986
left a comment
There was a problem hiding this comment.
Summary
Tests-only PR (+107/-0 in tests/mesh/test_resume_replay.py) that converts the existing source-introspection pin for issue #264 (TestResumeCacheKeyNamespaceIsolation in test_pr6_review_pins.py) into two runtime behavioural pins. The first asserts that a bridge-transport resume keyed ("body", "f00b") cannot pre-occupy the slot of a Zenoh resume keyed ("wire", "f00b") sharing the same proof_nonce; the second asserts in-domain replay defence is preserved for (wire_zid, proof_nonce). Both tests' assertions match the production key shape on core.py:2076-2077 (issuer_key = ("wire", wire_zid) if wire_zid is not None else ("body", issuer_id)), and the helper _make_zenoh_envelope correctly mirrors the MAC field set the receiver re-derives when a sample carries source_id.zid (binds source_zid into the HMAC input alongside peer_id/t/lockout_elapsed_s/proof_nonce).
Verified locally on 8f6408bb: hatch run python -m pytest tests/mesh/test_resume_replay.py -q -> 16 passed.
What's good
- Direct execution of the AGENTS.md > Review Learnings (PR #85) > Testing rule: "Pin regression tests for reviewed fixes -- every review fix gets a test that fails on pre-fix code." PR description shows the pre-fix verification (
bridge_peer_id_cannot_evictfails when the key is reverted to the conflating shape), which is exactly thefail-on-pre-fix-codecontract. - Scope discipline: tests-only diff, no production code changes, no public-API surface touched. Zero one-way-door risk for v0.4.0.
- Tests assert on observable behaviour (
_estop_lockout.is_set()) AND on the cache key shape ((("body", id), nonce) in m._resume_replay_cache), so the same test catches both a behavioural regression and a silent key-shape refactor. Belt-and-braces here is appropriate because the two existing source-introspection guards intest_pr6_review_pins.pywould silently start passing if the cache was renamed or moved. - The Zenoh fixture (
_zenoh_sample,_make_zenoh_envelope) matches the sameMagicMock-based style as the pre-existing_sample/_make_envelopehelpers; no new dependency footprint.
Must fix before merge
(none -- PR is ready to merge once any follow-ups are tracked)
Follow-up in v0.4.1
- Helper hygiene:
_make_zenoh_envelopeand_make_envelope(pre-existing) duplicate the JSON-encoded MAC construction with the only difference being whethersource_zidis bound. Worth collapsing into a single_make_envelope(..., wire_zid=None)helper in a follow-up cleanup PR -- not blocking, since the duplication is local to the test module and matches the asymmetry on the receiver side. (tests/mesh/test_resume_replay.py:41,:402). - Pin the pre-fix demonstration: PR description shows
pytest -k bridge_peer_id_cannot_evictfailing on a manually-reverted core.py, but that one-shot is not captured in CI. The existing source-introspection guards intests/mesh/test_pr6_review_pins.py:357(TestResumeCacheKeyNamespaceIsolation) already block re-introduction of thewire_zid or issuer_idpattern, so this is double-covered today; track as a low-priority issue if the structural guard is ever removed.
Verification suggestions
Standard CI is sufficient. Spot-check locally with:
hatch run python -m pytest tests/mesh/test_resume_replay.py tests/mesh/test_pr6_review_pins.py -qBoth files together cover the structural guard + the two new runtime pins.
yinsong1986
left a comment
There was a problem hiding this comment.
Summary
Adds two runtime behavioural pins for the domain-tagged resume-replay cache key fix in Mesh._on_safety_resume (issue #264). Pre-fix the cache key was (wire_zid or issuer_id, proof_nonce), which conflated a Zenoh wire_zid and a bridge issuer_id that happened to share the same hex string into one cache slot. The fix in strands_robots/mesh/core.py:2103 keys on (("wire"|"body", id), proof_nonce). The new tests pin both the cross-domain isolation (test_bridge_peer_id_cannot_evict_zenoh_wire_zid_resume_slot) and the within-domain replay defence (test_zenoh_wire_zid_resume_replay_within_domain_still_rejected).
Verified locally: 16/16 pass on HEAD; reverting the cache-key shape to the old (wire_zid or issuer_id, proof_nonce) form makes test_bridge_peer_id_cannot_evict_zenoh_wire_zid_resume_slot fail as the PR description claims. Pre-fix breakage reproduces exactly.
What's good
- Test-only diff (+109 / -0), scope-disciplined.
- Pre-fix verification reproduced on this box; the test bites the documented bug.
- Pins both halves of the contract: cross-domain MUST be isolated AND within-domain MUST still reject replays. A future regression that flattens the key in either direction will trip one of these.
- Asserts on observable side-effects (
_estop_lockout.is_set()) AND on the cache key tuple shape, so the AGENTS.md "pin regression tests for reviewed fixes" rule is satisfied: the next refactor cannot silently restore the conflation. - Fixture
_make_zenoh_envelopemirrors the receiver-side MAC contract (source_zidin MAC inputs and body), so the test exercises the real verify path rather than short-circuiting it. The_zenoh_samplehelper produces strings that pass_extract_sample_source_zid's^[0-9a-f]{1,32}$guard. - ASCII-only, no host paths, no emojis.
Must fix before merge
(none — PR is ready to merge once any follow-ups are tracked)
Follow-up in v0.4.1
- Module docstring stale post-#264.
tests/mesh/test_resume_replay.pylines 1–13 still describe the cache as keyed on((issuer_peer_id, proof_nonce) tuple). After this PR's fix the production key is(("wire"|"body", id), proof_nonce). The PR didn't introduce the drift but is the natural moment to refresh the bullet. Pure docs polish; not blocking. Track as a small docs PR or fold into the next mesh-touching change. - Test suite reaches deeply into private mesh state.
m._resume_replay_cache,m._estop_lockout,m._on_safety_resume(...)are all underscore-prefixed. The two new tests follow the file's existing convention so the diff is consistent, but the larger pattern is worth a tracking issue: AGENTS.md "Test behavior, not implementation" is in partial tension with assertions on internal cache-key tuple shape. Current shape is justified (the cache key is the contract being fixed) but if_resume_replay_cacheevolves — per-receiver scoping, HMAC fingerprinting, etc. — every shape assertion across the file will need to be updated in lockstep.
Verification suggestions
Standard CI is sufficient. To independently spot-check the pre-fix verification on a local checkout:
# revert the domain-tagged key shape locally
python -c "import re,sys;p='strands_robots/mesh/core.py';s=open(p).read();open(p,'w').write(s.replace('issuer_key = (\"wire\", wire_zid) if wire_zid is not None else (\"body\", issuer_id)\n cache_key = (issuer_key, proof_nonce)','cache_key = (wire_zid or issuer_id, proof_nonce)'))"
hatch run python -m pytest tests/mesh/test_resume_replay.py::test_bridge_peer_id_cannot_evict_zenoh_wire_zid_resume_slot -q # expect FAIL
git checkout -- strands_robots/mesh/core.py
hatch run python -m pytest tests/mesh/test_resume_replay.py -q # expect 16 passed
yinsong1986
left a comment
There was a problem hiding this comment.
Summary
Tests-only PR (+109/-0 in tests/mesh/test_resume_replay.py) that pins runtime behavior of the domain-tagged resume replay cache key shipped in core.py:_on_safety_resume. Two new tests:
test_bridge_peer_id_cannot_evict_zenoh_wire_zid_resume_slot— the regression pin for #264. Same id string (f00b) on the bridge body-peer side and the Zenoh wire side with a sharedproof_noncemust occupy distinct cache slots; the PR description verifies this fails pre-fix and passes post-fix.test_zenoh_wire_zid_resume_replay_within_domain_still_rejected— the symmetry pin: the new domain tag must not weaken in-domain replay defence.
Verified locally: python -m pytest tests/mesh/test_resume_replay.py -q -> 16 passed in 2.88s. Helpers _make_zenoh_envelope and _zenoh_sample correctly mirror the production MAC construction (source_zid included only when wire_zid is not None, matching core.py:2052-2058) and the _ZENOH_ZID_PATTERN = ^[0-9a-f]{1,32}$ constraint at core.py:229 (both "f00b" and "abcd1234" are valid hex digests of 1..32 chars).
What's good
- Textbook regression-pin shape per AGENTS.md > Review Learnings (#85) > 'Pin regression tests for reviewed fixes' — both the bug-shape (cross-domain collision) and the symmetry case (in-domain replay still rejected) are pinned.
- Pre-fix failure is explicitly demonstrated in the PR body (
1 failed, 15 deselectedafter reverting the key shape) — this is the bar AGENTS.md asks for and PRs often skip. - Scope discipline: no production code touched, no public API / wire format / persisted state changes, no one-way doors.
- Uses
monkeypatch.setenvper AGENTS.md > Review Learnings (#86) > 'Testing Patterns'. - No host paths, no emojis, no orphan combining marks.
- MAC-binding fields in
_make_zenoh_envelopeexactly mirror the receiver-side derivation atstrands_robots/mesh/core.py:2046-2063— this is the part most regression pins get wrong.
Must fix before merge
(none — PR is ready to merge once any follow-ups are tracked)
Follow-up in v0.4.1
tests/mesh/test_resume_replay.py:413—import json as _jsonis a re-import of the module-levelimport jsonalready at line 17. Mirrors the existing style in_make_envelopeso harmless, but a one-line cleanup that drops the duplicate alias from both helpers would tidy the module. Pure style; track only if a broader test-helper pass happens.tests/mesh/test_resume_replay.py:464,488— both new tests assert on the precise tuple shape stored in the privatem._resume_replay_cachedict. This is intentional (the key-shape IS the #264 fence) but it couples the test to an internal data structure; a follow-up could expose a small read-only inspector (Mesh._resume_cache_contains(issuer_key, nonce)) so future refactors of the cache representation don't force test churn. Not blocking._make_zenoh_envelope/_zenoh_samplehelpers will likely be reused by future mesh tests covering wire-zid binding (e.g. #266 estop-cache symmetry, any futuresource_zidwork). Promoting them totests/mesh/conftest.pynext time another test file needs them avoids a copy-paste fork.
Verification suggestions
Standard CI is sufficient. For paranoia, the pre-fix verification claim can be reproduced by reverting strands_robots/mesh/core.py:2083 to the conflating shape (issuer_key = wire_zid or issuer_id) and re-running:
hatch run python -m pytest tests/mesh/test_resume_replay.py::test_bridge_peer_id_cannot_evict_zenoh_wire_zid_resume_slot -q
Must report 1 failed pre-fix, 1 passed post-fix. The PR description shows this run; spot-check is optional.
|
Status: keeping this open — confirmed distinct value, not redundant. Verified against # Domain-tagged key prevents namespace collision between Zenoh
# wire_zid (hex, TLS-bound) and body issuer_id (app metadata).
issuer_key = ("wire", wire_zid) if wire_zid is not None else ("body", issuer_id)
cache_key = (issuer_key, proof_nonce)That means this PR is tests-only regression coverage for a behavior already in The Next action to unblock: rebase |
…trands-labs#264) The resume replay cache key is domain-tagged ("wire", wire_zid) vs ("body", issuer_id) in _on_safety_resume so a Zenoh wire_zid and a bridge issuer_id sharing the same string never collide. Existing pins were source-introspection only. Add two runtime behavioral pins: - test_bridge_peer_id_cannot_evict_zenoh_wire_zid_resume_slot: a bridge resume with peer_id=f00b does not pre-occupy the slot of a later Zenoh resume with wire_zid=f00b sharing the same proof_nonce; the Zenoh resume clears its lockout. Fails pre-fix (conflating wire_zid or issuer_id key), passes post-fix. - test_zenoh_wire_zid_resume_replay_within_domain_still_rejected: in-domain replay of a Zenoh (wire_zid, proof_nonce) is still rejected. closes strands-labs#264
Whitespace-only normalization to resolve the format --check failure. No logic change.
7dfec5e to
2bcf094
Compare
yinsong1986
left a comment
There was a problem hiding this comment.
Summary
Tests-only PR (+109/-0 in tests/mesh/test_resume_replay.py) that adds two runtime behavioural pins for the domain-tagged resume-replay cache key in Mesh._on_safety_resume (issue #264). The first asserts a bridge-transport resume keyed ("body", id) cannot pre-occupy the slot of a Zenoh resume keyed ("wire", id) sharing the same proof_nonce; the second asserts in-domain replay of a (wire_zid, proof_nonce) pair is still rejected. Author reports pre-fix FAIL / post-fix PASS, which is exactly the regression-pin shape AGENTS.md > Review Learnings (#85) > Testing > 'Pin regression tests for reviewed fixes' calls for.
What's good
- Pre-fix verification documented in the PR body (one test fails on the conflating key, all 18 pass on the fixed key).
- Helpers (
_zenoh_sample,_make_zenoh_envelope) faithfully mirror what_extract_sample_source_zidreads and rebind the MAC to includesource_zid, so the pins exercise the real receiver path rather than a stub. - Scope discipline: no production code touched; existing
TestResumeCacheKeyNamespaceIsolationsource-introspection pin is retained alongside the new runtime pins.
Must fix before merge
No blocking concerns found.
Summary
Issue #264 asks the resume replay cache key to be domain-tagged so a Zenoh
wire_zid(hex, TLS-bound) and a bridge-transportissuer_id(app metadata) that happen to share the same string cannot collide into one cache slot.The fix is already in
strands_robots/mesh/core.py(_on_safety_resume):But the existing pins (
TestResumeCacheKeyNamespaceIsolation) were source-introspection only. This PR adds two runtime behavioral pins.Tests added (tests/mesh/test_resume_replay.py)
test_bridge_peer_id_cannot_evict_zenoh_wire_zid_resume_slot: a bridge resume withpeer_id='f00b'does not pre-occupy the slot of a later Zenoh resume withwire_zid='f00b'sharing the sameproof_nonce; the Zenoh resume clears its lockout normally. Verified to FAIL pre-fix (conflatingwire_zid or issuer_idkey) and PASS post-fix.test_zenoh_wire_zid_resume_replay_within_domain_still_rejected: in-domain replay of a Zenoh(wire_zid, proof_nonce)is still rejected (domain tag does not weaken replay defence).Test output
Pre-fix verification (key reverted to old conflating shape):
closes #264
§13 Review Rounds
test_resume_replay.pyhunk overlap with #342 (per-issuer fairness cap tests) -- both test sets kept, ruff-formatted2bcf094tests/mesh/test_resume_replay.py