Milestone 4 establishes infrastructure for offline decision evaluation. This guide explains how to use the current replay helpers to check recorded decision metadata and policy-version consistency against historical evidence.
The replay system allows you to:
- Load historical evidence from your Switchboard logs
- Check recorded decisions against a target policy version
- Compare policy-version alignment across sessions and turns
- Validate policy changes before deploying them
Note: current replay helpers do not recompute routing decisions from raw inputs. matches indicates policy-version equality between recorded evidence and the provided policyVersion.
node --input-type=module -e "
import { loadSessionEvidence } from './src/switchboard/workflow.js';
const evidence = loadSessionEvidence({
logPath: process.env.HOME + '/.model-switchboard/switchboard-turns.ndjson',
sessionId: 'my-session-123'
});
console.log('Loaded ' + evidence.length + ' decisions from session');
"node --input-type=module -e "
import { loadSessionEvidence, replayRoutingDecision } from './src/switchboard/workflow.js';
const evidence = loadSessionEvidence({
logPath: process.env.HOME + '/.model-switchboard/switchboard-turns.ndjson',
sessionId: 'my-session-123'
});
const result = replayRoutingDecision({
evidence: evidence[0],
policyVersion: '0.1.0-experimental'
});
console.log(result);
// {
// status: 'replayed',
// originalSelectedTargetId: 'anthropic-coder',
// matches: true,
// confidence: 0.92,
// ...
// }
"node --input-type=module -e "
import { loadSessionEvidence, evaluatePolicyOnEvidence } from './src/switchboard/workflow.js';
const evidence = loadSessionEvidence({
logPath: process.env.HOME + '/.model-switchboard/switchboard-turns.ndjson',
sessionId: 'my-session-123'
});
const evaluation = evaluatePolicyOnEvidence({
evidenceSet: evidence,
policyVersion: '0.1.0-experimental'
});
console.log(evaluation);
// {
// status: 'evaluated',
// totalDecisions: 15,
// matchCount: 13,
// matchRate: '86.7%',
// avgConfidence: '88.2%',
// switchingReasons: { no_switch: 10, continuity_cost: 3, escalation: 2 },
// ...
// }
"You want to validate policy rollout consistency:
- Change: You introduced policy version
0.2.0and want to check which historical decisions were logged under older versions.
node --input-type=module -e "
import { loadSessionEvidence, evaluatePolicyOnEvidence } from './src/switchboard/workflow.js';
const evidence = loadSessionEvidence({
logPath: process.env.HOME + '/.model-switchboard/switchboard-turns.ndjson',
sessionId: 'my-session-123'
});
const baseline = evaluatePolicyOnEvidence({
evidenceSet: evidence,
policyVersion: '0.1.0-experimental'
});
console.log(JSON.stringify(baseline, null, 2));
"node --input-type=module -e "
import { loadSessionEvidence, replayRoutingDecision } from './src/switchboard/workflow.js';
const evidence = loadSessionEvidence({
logPath: process.env.HOME + '/.model-switchboard/switchboard-turns.ndjson',
sessionId: 'my-session-123'
});
const versionResults = evidence.map(e => replayRoutingDecision({ evidence: e, policyVersion: '0.2.0' }));
const mismatches = versionResults.filter(r => !r.matches);
console.log(\`Version mismatches: \${mismatches.length} / \${evidence.length}\`);
"- If match rate is high: Most decisions were already logged under the target policy version
- If many mismatches: Evidence contains decisions logged under other policy versions
- If match rate is 0%: The requested policy version is absent from this session's evidence
npm test
git commit -m "policy: adjust continuity-cost thresholds"- 100% match: All recorded decisions already use the compared policy version
- >90% match: Most decisions use the compared policy version
- 75-90% match: Mixed-policy evidence; investigate rollout boundaries
- <75% match: Majority of decisions were recorded under different policy versions
Average router confidence (0.0-1.0) in the decisions:
- >0.85: High confidence, good signal
- 0.70-0.85: Medium confidence, monitor regressions
- <0.70: Low confidence, consider escalation or review
Shows which decision factors triggered target switches:
no_switch: No switch (stayed on current)continuity_cost: Continuity cost evaluation triggered switchcapability_gap: Hard constraint (missing capability) triggered switchuser_override: User override triggered switchescalation: Escalation policy (low confidence, etc.) triggered switchavailability: Availability constraint triggered switch
High switching frequency may indicate:
- More aggressive policy (good for certain modes)
- Over-switching (may hurt continuity)
Test multiple policy versions at once:
import { loadSessionEvidence, replayRoutingDecision } from './src/switchboard/workflow.js';
const evidence = loadSessionEvidence({ logPath: '...', sessionId: '...' });
const policies = ['0.1.0-experimental', '0.2.0-draft', '0.2.0-conservative'];
const results = {};
for (const policy of policies) {
const replayed = evidence.map(e => replayRoutingDecision({ evidence: e, policyVersion: policy }));
const matches = replayed.filter(r => r.matches).length;
results[policy] = {
matchCount: matches,
matchRate: ((matches / evidence.length) * 100).toFixed(1) + '%'
};
}
console.log(results);The test suite includes pre-recorded evidence fixtures for deterministic policy evaluation:
import { planSwitchboardTurn } from './src/switchboard/workflow.js';
import assert from 'assert/strict';
// Recorded evidence from a specific session
const fixtures = {
sessionId: 'test-session-123',
threadId: 'test-thread-1',
evidence: [
// ... evidence objects loaded from switchboard-turns.ndjson
]
};
// Test a policy against the loaded evidence
fixtures.evidence.forEach((e, idx) => {
const result = replayRoutingDecision({
evidence: e,
policyVersion: '0.2.0'
});
assert.equal(result.matches, true, `Decision ${idx} should match`);
});Current replay system:
- Scope: Checks recorded metadata consistency, not re-computed routing outcomes
- Policy input: Compares against recorded policy versions only
- Outcome: Does not include outcome feedback (success/failure)
Future enhancements:
- Outcome-aware evaluation: Factor in whether decisions succeeded or failed
- True policy replay: Recompute routing decisions from evidence inputs and compare selected targets/outcomes
- A/B testing: Compare two policies head-to-head with statistical significance
- Regression detection: Automatic flagging of decisions that would have regressed
- Replay optimizations: Cache results for faster iteration
- Verify the session ID matches what's in your logs
- Check log file path is correct:
ls -la ~/.model-switchboard/switchboard-turns.ndjson - Ensure the session has run at least one turn
- Policy version in evidence may not match what you're comparing against
- Check that policy changes are deployed
- Verify targets registry matches what was available during original run
- Legacy log entries may not have full attribution data
- Replay on M4+ evidence (generated after this milestone) for accurate attribution
- Router Contracts — Normalized event shapes
- Attribution Store — Outcome tracking API
- Decision Log — Policy decisions and rationales