Skip to content

Commit 5eb4460

Browse files
Merge pull request #25 from ContextLab/003-ux-bugfix-cleanup
UX bugfix sweep: 13 fixes, GP algorithm repair, LaTeX rendering cleanup
2 parents 87a8cc4 + 4b01a98 commit 5eb4460

79 files changed

Lines changed: 2844 additions & 1754 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.gitignore

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -310,5 +310,18 @@ data/videos/.working/audio_cache/
310310
data/videos/transcripts_raw/
311311
data/videos/transcripts/
312312

313+
# Session notes (working artifacts)
314+
notes/
315+
316+
# Test data
317+
test-import.json
318+
319+
# POC / scratch scripts
320+
scripts/poc_*
321+
322+
# Speckit tooling
323+
.specify/
324+
.claude/commands/
325+
313326
# Python virtual environments
314327
.venv/

AGENTS.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -72,7 +72,7 @@ mapper/
7272
|--------|-------|
7373
| Wikipedia articles | ~250,000 (48,259 with coordinates across domains) |
7474
| Knowledge domains | 50 (flat hierarchy in index.json) |
75-
| Quiz questions | 2,450 (50 per domain, GPT-5-nano generated) |
75+
| Quiz questions | 2,450 (50 per domain, Claude Opus 4.6 generated) |
7676
| Khan Academy videos | 5,044 in catalog |
7777
| Video transcript windows | 77,408 (512-word sliding windows, 50-word stride) |
7878
| Transcripts on disk | 5,400+ (.txt files from Whisper) |
@@ -97,7 +97,7 @@ mapper/
9797
| Loading modal | `src/ui/progress.js` | Centered modal with spinner |
9898
| **Pipeline** | | |
9999
| Embed articles | `scripts/generate_embeddings_local_full.py` | 250K articles → 768-dim |
100-
| Generate questions | `scripts/generate_domain_questions.py` | GPT-5-nano, 50/domain |
100+
| Generate questions | `scripts/generate_domain_questions.py` | Claude Opus 4.6, 50/domain |
101101
| Embed questions | `scripts/embed_questions_v2.py` | Same model as articles |
102102
| Embed transcripts (full) | `scripts/embed_transcripts.py` | One embedding per video |
103103
| Embed transcripts (windows) | `scripts/embed_video_windows.py` | Sliding windows per video |
@@ -122,7 +122,7 @@ mapper/
122122
wikipedia.pkl (250K articles, gitignored)
123123
↓ generate_embeddings_local_full.py
124124
embeddings/wikipedia_embeddings.pkl (250K × 768)
125-
↓ generate_domain_questions.py (GPT-5-nano)
125+
↓ generate_domain_questions.py (Claude Opus 4.6)
126126
data/domains/.working/*-questions-batch*.json (50 per domain)
127127
↓ embed_questions_v2.py
128128
embeddings/question_embeddings_2500.pkl (2500 × 768)
@@ -170,7 +170,7 @@ User answers question → estimator updates knowledge map
170170
- **macOS env vars**: Scripts set `TOKENIZERS_PARALLELISM=false`, `OMP_NUM_THREADS=1`, `MKL_NUM_THREADS=1`
171171
- **Python venv**: Use `.venv/bin/python3` (not system python) for numpy 2.x compatibility
172172
- **Embedding model**: `google/embeddinggemma-300m` everywhere (768-dim, SentenceTransformer)
173-
- **LLM model**: `gpt-5-nano` via OpenAI Batch API for question generation
173+
- **LLM model**: `Claude Opus 4.6` via Anthropic API for question generation
174174
- **localStorage**: Browser-side persistence, versioned schema. No server-side storage.
175175
- **Domain bundles**: Background pre-loaded at boot for instant switching (no loading modal per switch)
176176
- **Domain viewport**: Read from `registry.getDomain(id).region` (index.json), not from bundle

README.md

Lines changed: 14 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
An interactive visualization that maps your conceptual knowledge across 250,000 Wikipedia articles and 5,000+ Khan Academy videos. Answer questions to watch a real-time heatmap of your strengths and gaps emerge, then get personalized video recommendations to fill knowledge gaps.
44

5-
**[Try the live demo](https://contextlab.github.io/mapper/)** | **[Read the paper](https://psyarxiv.com/dh3q2)**
5+
**[Try the live demo](https://contextlab.github.io/mapper/)** | **[Read the paper](https://osf.io/preprints/psyarxiv/dh3q2)**
66

77
## How It Works
88

@@ -17,12 +17,14 @@ Under the hood, text embedding models place every article, question, and video t
1717
## Features
1818

1919
- **50 knowledge domains** including Physics, Biology, Mathematics, Computer Science, Philosophy, and more
20-
- **2,450 adaptive quiz questions** generated via GPT-5-nano from Wikipedia source articles
21-
- **5,000+ Khan Academy videos** with knowledge-gap-based recommendations
22-
- **Real-time heatmap** powered by radial basis function interpolation
23-
- **Video trajectories** -- hover a video dot to see its topic path across the map
20+
- **2,500 adaptive quiz questions** generated via Claude Opus 4.6 from Wikipedia source articles
21+
- **5,400+ Khan Academy videos** with knowledge-gap-based recommendations
22+
- **Real-time heatmap** powered by Gaussian Process interpolation with Matern 3/2 kernel
23+
- **Video discovery panel** -- left sidebar with toggleable video visibility, scrollable list, and map trajectory highlighting
24+
- **Video trajectories** -- hover a video dot to see its topic path across the map; click to play
2425
- **Knowledge insights** -- see your strongest/weakest concepts and learning suggestions
2526
- **Social sharing** -- export your knowledge map as an image with grid lines and colorbar
27+
- **Keyboard shortcuts** -- press A/B/C/D to answer, with modifier-key awareness to avoid conflicts
2628
- **Fully client-side** -- no data leaves your browser; progress saved to localStorage
2729

2830
## Quick Start
@@ -53,7 +55,7 @@ mapper/
5355
│ ├── domain/ # Domain data loading and registry
5456
│ ├── learning/ # Adaptive quiz engine + video recommender
5557
│ ├── state/ # Application state and persistence
56-
│ ├── ui/ # UI components (controls, quiz, insights, share, video modal)
58+
│ ├── ui/ # UI components (controls, quiz, insights, share, video panel/modal)
5759
│ ├── utils/ # Math, accessibility, feature detection
5860
│ └── viz/ # Canvas rendering (heatmap, minimap, particles)
5961
├── data/ # Pre-computed data bundles
@@ -69,7 +71,7 @@ mapper/
6971
The `scripts/` directory contains the Python pipeline that generates the data powering the frontend:
7072

7173
1. **Embed articles** using `google/embeddinggemma-300m` (768-dim vectors)
72-
2. **Generate questions** via GPT-5-nano (50 per domain, 2,450 total)
74+
2. **Generate questions** via Claude Opus 4.6 (50 per domain, 2,450 total)
7375
3. **Embed questions** using the same model (for coordinate consistency)
7476
4. **Transcribe videos** via Whisper on GPU cluster (5,400+ Khan Academy transcripts)
7577
5. **Embed transcripts** -- both full-document and sliding-window (512 words, 50-word stride)
@@ -81,16 +83,16 @@ The `scripts/` directory contains the Python pipeline that generates the data po
8183
## Testing
8284

8385
```bash
84-
npx vitest run # 75 unit tests (estimator, sampler, recommender)
85-
npx playwright test # 8 E2E test specs (quiz flow, video recs, sharing)
86+
npx vitest run # 82 unit tests (estimator, sampler, recommender, stability)
87+
npx playwright test # 9 E2E test specs (quiz flow, video recs, sharing, edge cases)
8688
```
8789

8890
## Citation
8991

9092
```bibtex
91-
@article{manning2025mapper,
92-
title={Text embedding models yield high-resolution insights into conceptual knowledge},
93-
author={Manning, Jeremy R},
93+
@article{fitzpatrick2025mapper,
94+
title={Text embedding models yield detailed conceptual knowledge maps derived from short multiple-choice quizzes},
95+
author={Fitzpatrick, Paxton C. and Heusser, Andrew C. and Manning, Jeremy R.},
9496
year={2025},
9597
url={https://psyarxiv.com/dh3q2}
9698
}

data/domains/.working/computer-science-questions-batch1.json

Lines changed: 50 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -1,128 +1,128 @@
11
[
22
{
3-
"question_text": "An operating system manages computer hardware and software resources. What is generally considered the first operating system, created in 1956 for the IBM 704?",
4-
"correct_answer": "GM-NAA I/O, developed by General Motors and North American Aviation",
3+
"question_text": "What is an operating system in the context of computing?",
4+
"correct_answer": "System software that manages computer hardware and software resources and provides common services for programs",
55
"distractors": [
6-
"BESYS, developed by Bell Labs for the IBM 7090",
7-
"IBSYS, developed by IBM for its 7090 and 7094 mainframes",
8-
"CTSS, developed at MIT for the IBM 7094"
6+
"A physical circuit board that coordinates electrical signals between the CPU and memory chips",
7+
"A programming language used to write applications and compile them into machine code",
8+
"A network protocol that routes data packets between computers connected to the internet"
99
],
1010
"difficulty": 1,
1111
"source_article": "Operating system",
1212
"domain_ids": ["computer-science"],
1313
"concepts_tested": ["operating system"]
1414
},
1515
{
16-
"question_text": "The Internet originated from ARPANET. On October 29, 1969, the first ARPANET message was intended to be 'login,' but the system crashed. What letters were actually transmitted?",
17-
"correct_answer": "Only 'L' and 'O' were sent before the system crashed",
16+
"question_text": "What is the Internet?",
17+
"correct_answer": "The global system of interconnected computer networks that uses the TCP/IP protocol suite to communicate between networks and devices",
1818
"distractors": [
19-
"Only the letter 'L' was sent before the connection timed out",
20-
"The full word 'log' was transmitted before a buffer overflow occurred",
21-
"Only 'L', 'O', and 'G' were sent before a routing error stopped the transfer"
19+
"A single centralized server maintained by an international organization that hosts all publicly accessible websites worldwide",
20+
"A local wireless connection standard that links personal devices within a single home or office building",
21+
"A satellite broadcasting system that transmits television and radio signals to receivers located around the world"
2222
],
2323
"difficulty": 1,
2424
"source_article": "Internet",
2525
"domain_ids": ["computer-science"],
2626
"concepts_tested": ["internet"]
2727
},
2828
{
29-
"question_text": "A database is an organized collection of data managed by a DBMS. What does DBMS stand for, and what is its primary role?",
30-
"correct_answer": "Database Management System; it manages storage, retrieval, and administration of data",
29+
"question_text": "In computing, what is a database?",
30+
"correct_answer": "An organized collection of data stored electronically and managed through a database management system (DBMS)",
3131
"distractors": [
32-
"Data Block Management Service; it handles physical disk allocation and defragmentation",
33-
"Digital Backup and Migration System; it automates data replication across servers",
34-
"Database Monitoring and Security system; it provides real-time intrusion detection for stored data"
32+
"A physical filing cabinet system used to archive printed documents and paper records in offices",
33+
"A type of spreadsheet application designed exclusively for calculating and displaying financial records",
34+
"A backup power supply unit that preserves electronically stored data during unexpected electrical outages"
3535
],
3636
"difficulty": 1,
3737
"source_article": "Database",
3838
"domain_ids": ["computer-science"],
3939
"concepts_tested": ["database"]
4040
},
4141
{
42-
"question_text": "In a computer network, which topology connects every device directly to every other device, providing maximum redundancy but at the highest cabling cost?",
43-
"correct_answer": "Mesh topology (specifically full mesh topology)",
42+
"question_text": "What best describes a computer network?",
43+
"correct_answer": "A group of interconnected computers that communicate and share resources using communication protocols",
4444
"distractors": [
45-
"Star topology, where all devices connect through a central hub",
46-
"Ring topology, where each device connects to exactly two neighbors",
47-
"Bus topology, where all devices share a single backbone cable"
45+
"A single computer with multiple monitors attached for displaying different applications on each screen",
46+
"A software program that connects a computer's internal components such as the CPU and memory",
47+
"A collection of websites organized into categories and indexed for retrieval by a search engine"
4848
],
4949
"difficulty": 1,
5050
"source_article": "Computer network",
5151
"domain_ids": ["computer-science"],
5252
"concepts_tested": ["computer network"]
5353
},
5454
{
55-
"question_text": "Software is broadly divided into two main categories. Which two categories are they?",
56-
"correct_answer": "System software (e.g., operating systems) and application software (e.g., word processors)",
55+
"question_text": "What does the term 'software' refer to in computing?",
56+
"correct_answer": "Computer programs and associated data that provide instructions telling a computer what to do",
5757
"distractors": [
58-
"Compiled software (e.g., C programs) and interpreted software (e.g., Python scripts)",
59-
"Open-source software (e.g., Linux) and proprietary software (e.g., Windows)",
60-
"Firmware (e.g., BIOS) and middleware (e.g., database connectors)"
58+
"The physical electronic components inside a computer such as the processor and memory chips",
59+
"The electrical wiring and cable connections that link peripheral devices to the motherboard",
60+
"The metal and plastic casing that protects the internal circuitry of a computer from damage"
6161
],
6262
"difficulty": 1,
6363
"source_article": "Software",
6464
"domain_ids": ["computer-science"],
6565
"concepts_tested": ["software"]
6666
},
6767
{
68-
"question_text": "A computer virus replicates by inserting its code into other programs. What was the name of the first known computer virus, created by Bob Thomas at BBN Technologies in 1971?",
69-
"correct_answer": "Creeper, which spread across ARPANET displaying 'I'm the creeper, catch me if you can!'",
68+
"question_text": "What distinguishes a computer virus from other types of malicious software?",
69+
"correct_answer": "It replicates by modifying other computer programs and inserting its own code into those programs",
7070
"distractors": [
71-
"Brain, which infected IBM PC boot sectors and was created in Pakistan",
72-
"Elk Cloner, which spread via Apple II floppy disks and displayed a poem",
73-
"Morris Worm, which exploited Unix sendmail vulnerabilities to replicate across the Internet"
71+
"It encrypts all files on a hard drive and demands a cryptocurrency ransom payment to restore access",
72+
"It monitors keyboard input to secretly record passwords and send them to a remote attacker",
73+
"It spreads independently across networks without needing to attach itself to any existing programs"
7474
],
7575
"difficulty": 1,
7676
"source_article": "Computer virus",
7777
"domain_ids": ["computer-science"],
7878
"concepts_tested": ["computer virus"]
7979
},
8080
{
81-
"question_text": "According to NIST Special Publication 800-145, cloud computing has five essential characteristics. Which of the following is one of those five characteristics?",
82-
"correct_answer": "On-demand self-service, allowing users to provision computing resources automatically without human interaction",
81+
"question_text": "What is cloud computing?",
82+
"correct_answer": "The on-demand delivery of computing resources such as servers, storage, and applications over the internet",
8383
"distractors": [
84-
"Guaranteed uptime of 99.999%, ensuring continuous availability under all conditions",
85-
"Mandatory data encryption at rest, requiring all stored data to be encrypted by default",
86-
"Automatic geographic redundancy, replicating all data across at least three continents"
84+
"A weather prediction system that uses networked satellites to model atmospheric conditions and forecast storms",
85+
"A method of storing files exclusively on portable USB flash drives for secure offline access anywhere",
86+
"A local area network setup where all desktop computers share a single centralized physical hard drive"
8787
],
8888
"difficulty": 1,
8989
"source_article": "Cloud computing",
9090
"domain_ids": ["computer-science"],
9191
"concepts_tested": ["cloud computing"]
9292
},
9393
{
94-
"question_text": "Computer hardware refers to the physical components of a computer. Which component is often called the 'brain' of the computer because it executes instructions and performs calculations?",
95-
"correct_answer": "The central processing unit (CPU)",
94+
"question_text": "What does the term 'computer hardware' refer to?",
95+
"correct_answer": "The physical components of a computer, such as the CPU, RAM, motherboard, and storage devices",
9696
"distractors": [
97-
"The random-access memory (RAM), which stores running programs",
98-
"The motherboard, which connects all components together",
99-
"The graphics processing unit (GPU), which renders visual output"
97+
"The programs and applications installed on a computer that enable users to perform specific tasks",
98+
"The set of rules and protocols that govern how data is transmitted between devices on a network",
99+
"The graphical user interface elements like windows, icons, and menus displayed on a monitor screen"
100100
],
101101
"difficulty": 1,
102102
"source_article": "Computer hardware",
103103
"domain_ids": ["computer-science"],
104104
"concepts_tested": ["computer hardware"]
105105
},
106106
{
107-
"question_text": "A file system governs how data is organized and accessed on storage media. Which file system, introduced with Windows NT in 1993, replaced FAT32 and supports a maximum file size of 16 exabytes?",
108-
"correct_answer": "NTFS (New Technology File System)",
107+
"question_text": "In computing, what is a file system?",
108+
"correct_answer": "A method used by an operating system to organize, store, and retrieve files on a storage device",
109109
"distractors": [
110-
"ext4 (Fourth Extended File System), the default for most Linux distributions",
111-
"HFS+ (Hierarchical File System Plus), used by macOS before APFS",
112-
"ZFS (Zettabyte File System), originally developed by Sun Microsystems"
110+
"An antivirus program that scans documents for malicious code before allowing them to be opened",
111+
"A cloud-based service that automatically backs up all user files to a remote internet server",
112+
"A hardware component inside a hard drive that physically reads and writes data to the disk platters"
113113
],
114114
"difficulty": 1,
115115
"source_article": "File system",
116116
"domain_ids": ["computer-science"],
117117
"concepts_tested": ["file system"]
118118
},
119119
{
120-
"question_text": "According to the NIST Digital Identity Guidelines, what are the two parties involved in password-based authentication called?",
121-
"correct_answer": "The claimant (who holds the password) and the verifier (who checks the identity)",
120+
"question_text": "In computer security, what is a password?",
121+
"correct_answer": "A secret string of characters used to authenticate a user's identity and grant access to a system",
122122
"distractors": [
123-
"The principal (who requests access) and the guardian (who grants permissions)",
124-
"The authenticator (who provides credentials) and the arbiter (who validates them)",
125-
"The supplicant (who submits the secret) and the gatekeeper (who controls entry)"
123+
"A hardware security token that generates a unique radio frequency signal to physically unlock a device",
124+
"A biometric fingerprint scan stored on an encrypted chip embedded inside the computer's motherboard",
125+
"An encrypted file containing the user's complete browsing history and all saved website preferences"
126126
],
127127
"difficulty": 1,
128128
"source_article": "Password",

data/domains/.working/economics-questions-batch3.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@
1313
"concepts_tested": ["classical economics"]
1414
},
1515
{
16-
"question_text": "In Kahneman and Tversky's prospect theory (1979), what key feature of the value function explains why the pain of losing $100 is felt more intensely than the pleasure of gaining $100?",
16+
"question_text": "In Kahneman and Tversky's prospect theory (1979), what key feature of the value function explains why the pain of losing \\$100 is felt more intensely than the pleasure of gaining \\$100?",
1717
"correct_answer": "Loss aversion, meaning the value function is steeper for losses than for gains",
1818
"distractors": [
1919
"Diminishing sensitivity, meaning each additional dollar of gain matters less",

data/domains/.working/theory-of-computation-questions-batch2.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -104,7 +104,7 @@
104104
"concepts_tested": ["NP-completeness"]
105105
},
106106
{
107-
"question_text": "The P versus NP problem, one of the seven Clay Millennium Prize Problems worth $1 million, asks whether every problem whose solution can be verified in polynomial time can also be what?",
107+
"question_text": "The P versus NP problem, one of the seven Clay Millennium Prize Problems worth \\$1 million, asks whether every problem whose solution can be verified in polynomial time can also be what?",
108108
"correct_answer": "Solved in polynomial time",
109109
"distractors": [
110110
"Verified in logarithmic time",

0 commit comments

Comments
 (0)