Replay Guide: Testing Router Policies

Milestone 4 establishes infrastructure for offline decision evaluation. This guide explains how to use the current replay helpers to check recorded decision metadata and policy-version consistency against historical evidence.

Overview

The replay system allows you to:

Load historical evidence from your Switchboard logs
Check recorded decisions against a target policy version
Compare policy-version alignment across sessions and turns
Validate policy changes before deploying them

Note: current replay helpers do not recompute routing decisions from raw inputs. matches indicates policy-version equality between recorded evidence and the provided policyVersion.

Quick Start

1. Collect Session Evidence

node --input-type=module -e "
import { loadSessionEvidence } from './src/switchboard/workflow.js';

const evidence = loadSessionEvidence({
  logPath: process.env.HOME + '/.model-switchboard/switchboard-turns.ndjson',
  sessionId: 'my-session-123'
});

console.log('Loaded ' + evidence.length + ' decisions from session');
"

2. Replay a Single Decision

node --input-type=module -e "
import { loadSessionEvidence, replayRoutingDecision } from './src/switchboard/workflow.js';

const evidence = loadSessionEvidence({
  logPath: process.env.HOME + '/.model-switchboard/switchboard-turns.ndjson',
  sessionId: 'my-session-123'
});

const result = replayRoutingDecision({
  evidence: evidence[0],
  policyVersion: '0.1.0-experimental'
});

console.log(result);
// {
//   status: 'replayed',
//   originalSelectedTargetId: 'anthropic-coder',
//   matches: true,
//   confidence: 0.92,
//   ...
// }
"

3. Evaluate Policy on Full Session

node --input-type=module -e "
import { loadSessionEvidence, evaluatePolicyOnEvidence } from './src/switchboard/workflow.js';

const evidence = loadSessionEvidence({
  logPath: process.env.HOME + '/.model-switchboard/switchboard-turns.ndjson',
  sessionId: 'my-session-123'
});

const evaluation = evaluatePolicyOnEvidence({
  evidenceSet: evidence,
  policyVersion: '0.1.0-experimental'
});

console.log(evaluation);
// {
//   status: 'evaluated',
//   totalDecisions: 15,
//   matchCount: 13,
//   matchRate: '86.7%',
//   avgConfidence: '88.2%',
//   switchingReasons: { no_switch: 10, continuity_cost: 3, escalation: 2 },
//   ...
// }
"

Workflow: Validate Policy Version Alignment

Scenario

You want to validate policy rollout consistency:

Change: You introduced policy version 0.2.0 and want to check which historical decisions were logged under older versions.

Steps

1. Run current policy on stored evidence

node --input-type=module -e "
import { loadSessionEvidence, evaluatePolicyOnEvidence } from './src/switchboard/workflow.js';
const evidence = loadSessionEvidence({ 
  logPath: process.env.HOME + '/.model-switchboard/switchboard-turns.ndjson',
  sessionId: 'my-session-123'
});
const baseline = evaluatePolicyOnEvidence({
  evidenceSet: evidence,
  policyVersion: '0.1.0-experimental'
});
console.log(JSON.stringify(baseline, null, 2));
"

2. Compare evidence against target policy version

node --input-type=module -e "
import { loadSessionEvidence, replayRoutingDecision } from './src/switchboard/workflow.js';
const evidence = loadSessionEvidence({ 
  logPath: process.env.HOME + '/.model-switchboard/switchboard-turns.ndjson',
  sessionId: 'my-session-123'
});
const versionResults = evidence.map(e => replayRoutingDecision({ evidence: e, policyVersion: '0.2.0' }));
const mismatches = versionResults.filter(r => !r.matches);
console.log(\`Version mismatches: \${mismatches.length} / \${evidence.length}\`);
"

3. Compare results

If match rate is high: Most decisions were already logged under the target policy version
If many mismatches: Evidence contains decisions logged under other policy versions
If match rate is 0%: The requested policy version is absent from this session's evidence

4. Run tests and validate

npm test
git commit -m "policy: adjust continuity-cost thresholds"

Output Interpretation

Match Rate

100% match: All recorded decisions already use the compared policy version
>90% match: Most decisions use the compared policy version
75-90% match: Mixed-policy evidence; investigate rollout boundaries
<75% match: Majority of decisions were recorded under different policy versions

Confidence

Average router confidence (0.0-1.0) in the decisions:

>0.85: High confidence, good signal
0.70-0.85: Medium confidence, monitor regressions
<0.70: Low confidence, consider escalation or review

Switching Reasons Distribution

Shows which decision factors triggered target switches:

no_switch: No switch (stayed on current)
continuity_cost: Continuity cost evaluation triggered switch
capability_gap: Hard constraint (missing capability) triggered switch
user_override: User override triggered switch
escalation: Escalation policy (low confidence, etc.) triggered switch
availability: Availability constraint triggered switch

High switching frequency may indicate:

More aggressive policy (good for certain modes)
Over-switching (may hurt continuity)

Advanced: Batch Policy Evaluation

Test multiple policy versions at once:

import { loadSessionEvidence, replayRoutingDecision } from './src/switchboard/workflow.js';

const evidence = loadSessionEvidence({ logPath: '...', sessionId: '...' });

const policies = ['0.1.0-experimental', '0.2.0-draft', '0.2.0-conservative'];
const results = {};

for (const policy of policies) {
  const replayed = evidence.map(e => replayRoutingDecision({ evidence: e, policyVersion: policy }));
  const matches = replayed.filter(r => r.matches).length;
  results[policy] = {
    matchCount: matches,
    matchRate: ((matches / evidence.length) * 100).toFixed(1) + '%'
  };
}

console.log(results);

Fixtures for Testing

The test suite includes pre-recorded evidence fixtures for deterministic policy evaluation:

import { planSwitchboardTurn } from './src/switchboard/workflow.js';
import assert from 'assert/strict';

// Recorded evidence from a specific session
const fixtures = {
  sessionId: 'test-session-123',
  threadId: 'test-thread-1',
  evidence: [
    // ... evidence objects loaded from switchboard-turns.ndjson
  ]
};

// Test a policy against the loaded evidence
fixtures.evidence.forEach((e, idx) => {
  const result = replayRoutingDecision({ 
    evidence: e,
    policyVersion: '0.2.0'
  });
  assert.equal(result.matches, true, `Decision ${idx} should match`);
});

Limitations and Future Work

Current replay system:

Scope: Checks recorded metadata consistency, not re-computed routing outcomes
Policy input: Compares against recorded policy versions only
Outcome: Does not include outcome feedback (success/failure)

Future enhancements:

Outcome-aware evaluation: Factor in whether decisions succeeded or failed
True policy replay: Recompute routing decisions from evidence inputs and compare selected targets/outcomes
A/B testing: Compare two policies head-to-head with statistical significance
Regression detection: Automatic flagging of decisions that would have regressed
Replay optimizations: Cache results for faster iteration

Troubleshooting

No evidence found for session ID

Verify the session ID matches what's in your logs
Check log file path is correct: ls -la ~/.model-switchboard/switchboard-turns.ndjson
Ensure the session has run at least one turn

All decisions marked as mismatch

Policy version in evidence may not match what you're comparing against
Check that policy changes are deployed
Verify targets registry matches what was available during original run

Unexpected switching reasons

Legacy log entries may not have full attribution data
Replay on M4+ evidence (generated after this milestone) for accurate attribution

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replay Guide: Testing Router Policies

Overview

Quick Start

1. Collect Session Evidence

2. Replay a Single Decision

3. Evaluate Policy on Full Session

Workflow: Validate Policy Version Alignment

Scenario

Steps

1. Run current policy on stored evidence

2. Compare evidence against target policy version

3. Compare results

4. Run tests and validate

Output Interpretation

Match Rate

Confidence

Switching Reasons Distribution

Advanced: Batch Policy Evaluation

Fixtures for Testing

Limitations and Future Work

Troubleshooting

No evidence found for session ID

All decisions marked as mismatch

Unexpected switching reasons

See Also

FilesExpand file tree

REPLAY-GUIDE.md

Latest commit

History

REPLAY-GUIDE.md

File metadata and controls

Replay Guide: Testing Router Policies

Overview

Quick Start

1. Collect Session Evidence

2. Replay a Single Decision

3. Evaluate Policy on Full Session

Workflow: Validate Policy Version Alignment

Scenario

Steps

1. Run current policy on stored evidence

2. Compare evidence against target policy version

3. Compare results

4. Run tests and validate

Output Interpretation

Match Rate

Confidence

Switching Reasons Distribution

Advanced: Batch Policy Evaluation

Fixtures for Testing

Limitations and Future Work

Troubleshooting

No evidence found for session ID

All decisions marked as mismatch

Unexpected switching reasons

See Also