Skip to content

Commit 599f201

Browse files
Merge pull request #33 from ContextLab/004-persona-user-testing
Persona user testing: audit questions, fix UX bugs, incremental GP
2 parents 247b18a + 9ab8faa commit 599f201

103 files changed

Lines changed: 1844883 additions & 827 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.claude/skills/audit-questions/SKILL.md

Lines changed: 423 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 212 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,212 @@
1+
# Simulate Persona
2+
3+
Simulate a persona-based user test against the live Knowledge Mapper application.
4+
5+
## Usage
6+
7+
```
8+
/simulate-persona <PERSONA_ID>
9+
```
10+
11+
Example: `/simulate-persona P01` runs Alex the Tech Reporter simulation.
12+
13+
## Arguments
14+
15+
- `$ARGUMENTS`: Persona ID (P01–P21) or category name (reporter, expert, learner, power-user, pedant, edge-case)
16+
17+
## Pipeline Overview
18+
19+
The simulation runs a 4-phase pipeline (5 phases if issues are found):
20+
21+
1. **Phase 1: Playwright Automation** — Mechanical browser interaction
22+
2. **Phase 2: AI Cognitive Evaluation** — Task agent reads checkpoints + screenshots
23+
3. **Phase 3: Pedant Web Verification** — (Pedant only) Opus agent verifies corrections
24+
4. **Phase 4: Report Assembly** — Compile JSON + Markdown reports
25+
5. **Phase 5: Issue Triage & Fix** — Create GitHub issues, implement fixes, submit PRs
26+
27+
## Execution Steps
28+
29+
### Step 0: Setup
30+
31+
1. Read persona definition from `tests/visual/personas/definitions.js` — find the persona matching `$ARGUMENTS`
32+
2. Clean any stale working files: delete `tests/visual/.working/personas/{personaId}-*`
33+
3. Verify dev server is running at `http://localhost:5173/mapper/`
34+
4. Create TodoWrite entries for progress tracking
35+
36+
### Step 1: Playwright Automation (Phase 1)
37+
38+
Run the Playwright test for this persona:
39+
40+
```bash
41+
npx playwright test persona-agents.spec.js -g "Persona: {personaName}"
42+
```
43+
44+
For pedant personas:
45+
```bash
46+
npx playwright test persona-pedant.spec.js -g "Pedant: {personaName}"
47+
```
48+
49+
This produces:
50+
- `tests/visual/.working/personas/{personaId}-checkpoint-{N}.json` for each checkpoint
51+
- `tests/visual/screenshots/personas/{personaId}-checkpoint-{N}.png` for each screenshot
52+
53+
Each checkpoint JSON contains:
54+
```json
55+
{
56+
"personaId": "P01",
57+
"checkpointNumber": 1,
58+
"questionsAnswered": 5,
59+
"questionsInBatch": [
60+
{
61+
"questionId": "abc123",
62+
"questionText": "...",
63+
"options": { "A": "...", "B": "...", "C": "...", "D": "..." },
64+
"correctAnswer": "B",
65+
"selectedAnswer": "B",
66+
"wasCorrect": true,
67+
"difficulty": 2,
68+
"domainId": "physics",
69+
"sourceArticle": "..."
70+
}
71+
],
72+
"screenshotPath": "tests/visual/screenshots/personas/P01-checkpoint-1.png",
73+
"consoleErrors": [],
74+
"domainMappedPct": 12,
75+
"timestamp": 1709352000000
76+
}
77+
```
78+
79+
### Step 2: AI Cognitive Evaluation (Phase 2)
80+
81+
For EACH checkpoint, spawn a Task agent:
82+
83+
**Regular personas (Sonnet 4.6):**
84+
```
85+
Task agent (model: sonnet, subagent_type: general-purpose):
86+
"You are role-playing as {persona.name}. {persona.personality}
87+
88+
Read the checkpoint data at: {checkpointPath}
89+
Read the screenshot at: {screenshotPath}
90+
91+
BEFORE looking at the screenshot, state what you expect the map to look like.
92+
THEN read the screenshot and compare reality to your expectation.
93+
94+
For each question in this batch, evaluate:
95+
- Is the marked answer correct?
96+
- Are the distractors plausible?
97+
- Does the question test meaningful understanding?
98+
- Rate content validity, distractor quality, difficulty, educational value, clarity (1-5 each)
99+
100+
Write your evaluation as JSON to: {evalOutputPath}
101+
Use the AgentEvaluation schema from the data model."
102+
```
103+
104+
**Pedant personas (Opus 4.6):**
105+
```
106+
Task agent (model: opus, subagent_type: general-purpose):
107+
Same as above but with additional instructions:
108+
"If you disagree with any marked answer, use the WebSearch tool to verify.
109+
Search for authoritative sources. Cite the URL.
110+
If web evidence supports your correction: verdict = CORRECTION_VERIFIED
111+
If web evidence confirms original: verdict = ORIGINAL_CONFIRMED
112+
If inconclusive: verdict = INCONCLUSIVE
113+
NEVER hallucinate a correction without web evidence."
114+
```
115+
116+
Each evaluation produces:
117+
- `tests/visual/.working/personas/{personaId}-eval-{N}.json`
118+
119+
#### Category-Specific Evaluation Guidance
120+
121+
**Reporter agents (P01-P03)** should focus on:
122+
- Visual impact — would this screenshot look good in a tech article?
123+
- Question quality for non-expert audience — nothing too obscure
124+
- Polish — no loading spinners, no visual artifacts, smooth gradients
125+
- First impression criteria from expected-outcomes/reporters.json
126+
127+
**Expert agents (P04-P07)** should focus on:
128+
- Answer correctness — use real domain knowledge to verify marked answers
129+
- Difficulty calibration — do questions test conceptual understanding vs trivia?
130+
- Map accuracy — does the green/yellow/red distribution match their expertise profile?
131+
- Distractor quality — all four options should be plausible at first glance
132+
133+
**Learner agents (P08-P11)** should focus on:
134+
- Emotional arc — curiosity → mixed success → insight → continued engagement
135+
- Question diversity — no more than 5 consecutive questions on the same sub-topic
136+
- "Aha moments" — identify at least 1 moment where the map reveals something surprising
137+
- Map readability for non-experts — clear color differentiation, intuitive layout
138+
- Self-assessment: "Would I show this to a friend?" and "Did I learn something about myself?"
139+
140+
**Power user agents (P12-P14)** should focus on:
141+
- Estimator stability — no Cholesky errors, NaN, or Infinity values
142+
- Domain-mapped % smooth progression — no jumps >15 percentage points
143+
- Domain switching cleanliness (P13) — no state leakage between domains
144+
- Rapid input handling (P14) — no dropped answers or visual glitches
145+
146+
### Step 3: Pedant Web Verification (Phase 3 — pedant only)
147+
148+
For any question where the pedant agent flagged `isCorrectAsMarked: false`:
149+
150+
1. Read the eval JSON to find flagged questions
151+
2. If the agent already searched (webVerification.searched = true), the verification is done
152+
3. If not, spawn an additional Opus Task agent with WebSearch tool to verify
153+
4. Write all verified corrections to: `tests/visual/.working/personas/{personaId}-corrections.json`
154+
155+
### Step 4: Report Assembly (Phase 4)
156+
157+
1. Read all checkpoint JSONs and evaluation JSONs from `.working/personas/`
158+
2. Compile the PersonaReport:
159+
- Concatenate all belief narratives into experience summary
160+
- Collect all question evaluations into question audit
161+
- Collect all issues, sort by severity
162+
- Determine result: PASS / FAIL / AMBIGUOUS per spec criteria
163+
3. Write outputs:
164+
- `tests/visual/reports/{personaId}-report.json` (machine-readable)
165+
- `tests/visual/reports/{personaId}-report.md` (human-readable)
166+
167+
### Step 5: Issue Triage & Fix (Phase 5 — if issues found)
168+
169+
For each blocker or major issue discovered:
170+
171+
1. Create a GitHub issue on the feature branch describing the problem
172+
2. Spawn a Task agent to investigate and implement a fix
173+
3. Verify the fix by re-running the affected checkpoint
174+
4. Submit the fix as a commit on the `004-persona-user-testing` branch
175+
176+
## Resume from Checkpoint
177+
178+
If context runs out mid-simulation:
179+
180+
1. Check `tests/visual/.working/personas/` for existing files
181+
2. Find the highest checkpoint number with a corresponding eval file
182+
3. Resume from the next unevaluated checkpoint
183+
4. The Playwright test only needs to re-run if checkpoint data files are missing
184+
185+
## Working File Conventions
186+
187+
All intermediate files in `tests/visual/.working/personas/`:
188+
189+
| Pattern | Phase | Description |
190+
|---------|-------|-------------|
191+
| `{id}-checkpoint-{N}.json` | 1 | Playwright automation output |
192+
| `{id}-eval-{N}.json` | 2 | AI agent evaluation |
193+
| `{id}-corrections.json` | 3 | Pedant verified corrections |
194+
| `{id}-report.json` | 4 | Final compiled report |
195+
| `{id}-report.md` | 4 | Human-readable report |
196+
197+
## Pass/Fail Criteria
198+
199+
- **PASS**: All checkpoints met expectations. No blocker/major issues. Positive experience summary. ≤10% low-quality questions.
200+
- **FAIL**: Any blocker issue (crash, estimator collapse, wrong map). Negative experience summary. >25% problematic questions.
201+
- **AMBIGUOUS**: Only minor/cosmetic issues but mixed feelings. Small but consistent expectation-reality gaps. Requires human review.
202+
203+
## Persona Categories Quick Reference
204+
205+
| Category | IDs | Model | Checkpoint Interval | Special |
206+
|----------|-----|-------|--------------------|---------|
207+
| Reporter | P01-P03 | Sonnet | 4-5 | First impressions |
208+
| Expert | P04-P07 | Sonnet | 5 | Domain expertise verification |
209+
| Learner | P08-P11 | Sonnet | 5 | Emotional arc, aha moments |
210+
| Power User | P12-P14 | Sonnet | 10-20 | Stress test, stability |
211+
| Pedant | P19-P21 | Opus | 1 (every Q) | Web-verified corrections |
212+
| Edge Case | P15-P18 | Sonnet | 8-10 | Feature-specific testing |

.gitignore

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,7 +233,15 @@ embeddings/article_coords_flat.pkl
233233
embeddings/question_coords.pkl
234234
embeddings/question_coords_flat.pkl
235235
embeddings/transcript_coords.pkl
236+
embeddings/transcript_coords_flat.pkl
237+
embeddings/window_coords_flat.pkl
238+
embeddings/umap_article_coords.pkl
239+
embeddings/umap_question_coords.pkl
240+
embeddings/umap_transcript_coords.pkl
241+
embeddings/umap_window_coords.pkl
236242
embeddings/article_registry.pkl
243+
embeddings/domain_bounding_boxes.json
244+
embeddings/video_audit_results.json
237245
*.credentials
238246
.credentials/
239247

@@ -325,3 +333,8 @@ scripts/poc_*
325333

326334
# Python virtual environments
327335
.venv/
336+
337+
# Persona testing framework (generated output)
338+
tests/visual/.working/
339+
tests/visual/reports/
340+
tests/visual/screenshots/personas/

data/domains/algorithms.json

Lines changed: 1040 additions & 1 deletion
Large diffs are not rendered by default.

data/domains/all.json

Lines changed: 78441 additions & 1 deletion
Large diffs are not rendered by default.

data/domains/archaeology.json

Lines changed: 1041 additions & 1 deletion
Large diffs are not rendered by default.

data/domains/art-history.json

Lines changed: 130756 additions & 1 deletion
Large diffs are not rendered by default.

data/domains/artificial-intelligence-ml.json

Lines changed: 1068 additions & 1 deletion
Large diffs are not rendered by default.

data/domains/asian-history.json

Lines changed: 1137 additions & 1 deletion
Large diffs are not rendered by default.

data/domains/astrophysics.json

Lines changed: 125702 additions & 1 deletion
Large diffs are not rendered by default.

0 commit comments

Comments
 (0)