A curated, visual collection of memorable AI tests: spaghetti physics, hand rendering, language tricks, long-horizon games, citation checks, audio robustness, and research benchmarks.
This project is intentionally part museum, part field guide, and part internet culture archive. It is made for fun and curiosity while AI develops unusually fast. The exhibits are snapshots, not permanent verdicts on any model or company.
Repository: github.com/eudk/weird-ai-test-museum
- Headliners: memorable tests that escaped research circles and became internet culture.
- Image stress tests: hands, text rendering, spatial relationships, and compositional reasoning.
- Language tricks: counting, constraint following, ambiguity, and deceptively simple prompts.
- Game tests: long-horizon planning, memory, exploration, and the effect of agent harnesses.
- Video and audio: physics, identity preservation, synchronization, noisy-scene reasoning, and voice cloning.
- Real-world stress tests: cases such as the widely reported 18,000-water-cups drive-through request.
- Formal benchmarks: ARC-AGI, Humanity's Last Exam, SWE-Bench Pro, BrowseComp, OSWorld, Terminal-Bench, APEX-Agents, MCP-Atlas, MMMU-Pro, EVMbench, and others.
- Dated model snapshot: a fold-out comparison of selected published June 2026 results with tool and harness caveats.
The museum explains what each test is trying to reveal, why people found it memorable, and where to read more. Sources are linked directly from the exhibits.
It does not:
- declare one model universally best;
- treat a viral demo as a controlled scientific benchmark;
- assume an old score still describes the current frontier;
- present capability labels as permanent limits.
AI evaluation depends on the model version, date, tools, prompts, scaffolding, retries, scoring method, and dataset version. Numbers without that context age badly.
This project was inspired in part by the excellent BenchLM.ai AI Benchmarks Directory. Its broad catalog helped shape the museum's formal benchmark shelf and encouraged the mix of benchmark categories represented here.
The individual exhibits also draw from primary papers, official benchmark sites, reputable reporting, public leaderboards, and the wonderfully strange informal tests that spread through AI culture.
Memorable informal tests and formal benchmarks serve different purposes:
| Type | Useful for |
|---|---|
| Informal regression tests | Making visible inconsistencies easy to recognize and discuss |
| Real-world incidents | Revealing deployment, guardrail, and human-handoff problems |
| Formal benchmarks | Producing controlled, repeatable comparisons under stated conditions |
No build step or dependency installation is required.
Start-Process .\index.htmlYou can also open index.html directly in any modern browser.
AITestMuseum/
|-- assets/
| |-- examples.png
| `-- name.png
|-- index.html
|-- LICENSE
`-- README.md
The site is a single responsive HTML document with embedded CSS and JavaScript. Google Fonts are loaded from the web; the rest of the project is static.
When adding or revising an exhibit:
- Prefer a primary paper, official benchmark page, or reputable report.
- Date volatile scores and identify the exact evaluation setting.
- State when a test is informal, anecdotal, or dependent on a custom harness.
- Avoid unsupported precision and universal claims.
- Keep the tone curious, readable, and honest about uncertainty.
Independent educational project. Not affiliated with or endorsed by any person, company, model provider, or benchmark creator mentioned.
Linked material and third-party names belong to their respective owners.
The original code and project text are available under the MIT License. Third-party names, trademarks, linked material, and referenced media remain the property of their respective owners.
Made by eudk.

