Human vs AI Agreement for Cognitive Error Classification
A research tool for classifying cognitive error types in MCQ distractors and measuring inter-rater agreement using Cohen's Kappa (κ).
Every wrong answer a student gives reveals something different about how they think—but most educational software only sees "wrong."
ConfusionMapper classifies MCQ distractors into four cognitive error types from the Confusion Fingerprint Taxonomy:
| Code | Error Type |
|---|---|
| RF | Recall Failure |
| PK | Partial Knowledge |
| CF | Confabulation |
| INT | Interference |
A human researcher and ChatGPT independently classify each distractor. The system then computes Cohen's Kappa (κ), a widely used statistical measure of inter-rater agreement in psychology, education, and cognitive science.
The results are visualized through:
- Human vs AI Agreement Gauge
- Cohen's Kappa Reliability Score
- 4×4 Confusion Matrix
- Per-Type Agreement Statistics
Human vs AI Agreement
κ = 0.74
(Substantial Agreement)
A study protocol may require a minimum reliability threshold before research data collection can begin.
ConfusionMapper was developed as the pre-data-collection quality gate for a pre-registered randomized controlled trial involving 108 students at IIT Kanpur's Cognitive Science Department.
The study protocol required:
κ ≥ 0.70
before student data collection could proceed.
The software was built to operationalize, validate, and document that requirement.
Classifies distractors into:
- Recall Failure (RF)
- Partial Knowledge (PK)
- Confabulation (CF)
- Interference (INT)
- Independent classification by researcher and GPT
- Automatic agreement computation
- Cohen's Kappa calculation
- Reliability interpretation
- Kappa speedometer gauge
- Agreement summary
- Confusion matrix
- Category-level breakdowns
Includes a built-in NCERT question bank.
Generate new MCQs on any topic using the OpenAI API.
Export complete sessions as JSON files containing:
- Questions
- Distractors
- Human labels
- AI labels
- Agreement statistics
- Python
- Tkinter
- OpenAI API
- JSON
- Cohen's Kappa Statistical Analysis
Most educational systems treat wrong answers as simple mistakes.
ConfusionMapper treats them as cognitive signals.
By identifying why an answer is wrong—and validating those classifications through human-AI agreement—the tool helps researchers study learning, misconceptions, and knowledge representation with greater rigor.
- Expanded confusion taxonomy
- Multi-rater reliability analysis
- Research analytics dashboard
- Classroom-scale deployment
- Integration with Cognivia
- Educational intervention recommendations
Built using skills developed through Stanford Code in Place and applied to ongoing cognitive science research at IIT Kanpur.
MIT License
Detect. Measure. Understand.