Skip to content

eudk/weird-ai-test-museum

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

The Weird AI Test Museum

Live Museum License: MIT Static HTML Updated June 2026 The Weird AI Test Museum title and introduction

A curated, visual collection of memorable AI tests: spaghetti physics, hand rendering, language tricks, long-horizon games, citation checks, audio robustness, and research benchmarks.

This project is intentionally part museum, part field guide, and part internet culture archive. It is made for fun and curiosity while AI develops unusually fast. The exhibits are snapshots, not permanent verdicts on any model or company.

Explore

Open the museum

Repository: github.com/eudk/weird-ai-test-museum

Inside the museum

Museum navigation and the Will Smith Eating Spaghetti exhibit

  • Headliners: memorable tests that escaped research circles and became internet culture.
  • Image stress tests: hands, text rendering, spatial relationships, and compositional reasoning.
  • Language tricks: counting, constraint following, ambiguity, and deceptively simple prompts.
  • Game tests: long-horizon planning, memory, exploration, and the effect of agent harnesses.
  • Video and audio: physics, identity preservation, synchronization, noisy-scene reasoning, and voice cloning.
  • Real-world stress tests: cases such as the widely reported 18,000-water-cups drive-through request.
  • Formal benchmarks: ARC-AGI, Humanity's Last Exam, SWE-Bench Pro, BrowseComp, OSWorld, Terminal-Bench, APEX-Agents, MCP-Atlas, MMMU-Pro, EVMbench, and others.
  • Dated model snapshot: a fold-out comparison of selected published June 2026 results with tool and harness caveats.

What this project is

The museum explains what each test is trying to reveal, why people found it memorable, and where to read more. Sources are linked directly from the exhibits.

It does not:

  • declare one model universally best;
  • treat a viral demo as a controlled scientific benchmark;
  • assume an old score still describes the current frontier;
  • present capability labels as permanent limits.

AI evaluation depends on the model version, date, tools, prompts, scaffolding, retries, scoring method, and dataset version. Numbers without that context age badly.

Inspiration

This project was inspired in part by the excellent BenchLM.ai AI Benchmarks Directory. Its broad catalog helped shape the museum's formal benchmark shelf and encouraged the mix of benchmark categories represented here.

The individual exhibits also draw from primary papers, official benchmark sites, reputable reporting, public leaderboards, and the wonderfully strange informal tests that spread through AI culture.

Memorable informal tests and formal benchmarks serve different purposes:

Type Useful for
Informal regression tests Making visible inconsistencies easy to recognize and discuss
Real-world incidents Revealing deployment, guardrail, and human-handoff problems
Formal benchmarks Producing controlled, repeatable comparisons under stated conditions

Run locally

No build step or dependency installation is required.

Start-Process .\index.html

You can also open index.html directly in any modern browser.

Project structure

AITestMuseum/
|-- assets/
|   |-- examples.png
|   `-- name.png
|-- index.html
|-- LICENSE
`-- README.md

The site is a single responsive HTML document with embedded CSS and JavaScript. Google Fonts are loaded from the web; the rest of the project is static.

Updating exhibits

When adding or revising an exhibit:

  1. Prefer a primary paper, official benchmark page, or reputable report.
  2. Date volatile scores and identify the exact evaluation setting.
  3. State when a test is informal, anecdotal, or dependent on a custom harness.
  4. Avoid unsupported precision and universal claims.
  5. Keep the tone curious, readable, and honest about uncertainty.

Disclaimer

Independent educational project. Not affiliated with or endorsed by any person, company, model provider, or benchmark creator mentioned.

Linked material and third-party names belong to their respective owners.

License

The original code and project text are available under the MIT License. Third-party names, trademarks, linked material, and referenced media remain the property of their respective owners.


Made by eudk.

About

A fun, unofficial museum of memorable AI tests, strange edge cases, and research benchmarks.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages