Let AI run 100 experiments while you sleep. Autonomous experiment loops for any codebase — not just ML.
Inspired by @karpathy's autoresearch, rebuilt as reusable Claude Code slash commands that work on any project: ML training, API performance, bundle size, startup time, test coverage — anything with a measurable metric.
You write a program.md — a strategy document that tells the agent what to optimize, what files to touch, and what to try. Then you run /autoresearch 8h and go to sleep. The agent:
- Reads your strategy (
program.md) - Modifies code (only the files you allow)
- Runs the experiment
- Evaluates the metric
- Keeps improvements, reverts failures
- Logs everything to
results.tsv - Repeats until time runs out
You wake up to 50-100 experiments, a clear results log, and your code on the best-performing version.
You don't program Python. You program program.md. That's the whole point.
git clone https://github.com/adsol-digital/autoresearch.git
cd autoresearch
chmod +x install.sh
./install.shThis copies two slash commands to ~/.claude/commands/ — available globally in every project.
mkdir -p ~/.claude/commands
cp commands/autoresearch.md ~/.claude/commands/
cp commands/autoresearch-init.md ~/.claude/commands/cd your-project/
Then in Claude Code:
/autoresearch-init
This scans your codebase and generates a program.md tailored to your project. It asks you:
- What metric to optimize
- What command runs the experiment
- Which files the agent can edit
- What constraints to follow
/autoresearch 2h
That's it. The agent enters an autonomous loop for 2 hours.
| Command | Duration |
|---|---|
/autoresearch 30m |
30 minutes |
/autoresearch 2h |
2 hours |
/autoresearch 2h30m |
2 hours 30 minutes |
/autoresearch overnight |
8 hours |
/autoresearch all-day |
12 hours |
/autoresearch |
Unlimited (until you interrupt) |
Combine with a run tag:
/autoresearch mar28 2h
This runs for 2 hours on git branch autoresearch/mar28.
Each experiment follows this cycle:
Plan -> Implement -> Run -> Evaluate -> Record -> Log -> Adapt -> Loop
The agent prints a summary after every experiment:
--- Experiment 14 ---
Change: Reduced n_layer from 12 to 8, increased n_embd to 1024
Hypothesis: Wider-shallower model trains faster in 5min budget
Result: val_bpb = 1.087 (-0.012 vs baseline) -> keep
Best so far: 1.087 (experiment 14)
Total experiments: 14 | Kept: 6 | Discarded: 7 | Crashed: 1
Elapsed: 1h 10m | Remaining: 50m
---
When time's up, you get a full session report:
========================================
AUTORESEARCH SESSION COMPLETE
========================================
Duration: 2h 0m
Total experiments: 24
Kept: 8 | Discarded: 14 | Crashed: 2
Best result: val_bpb = 1.052 (experiment 19)
vs baseline: -0.047 (4.3% improvement)
Top 3 improvements:
1. Experiment 19: Cosine LR with warm restarts (-0.015)
2. Experiment 14: Wider-shallower architecture (-0.012)
3. Experiment 7: Increased batch size to 32K (-0.009)
Key findings:
- Width matters more than depth at this compute budget
- LR schedule changes had the biggest single impact
- Optimizer changes (betas) had minimal effect
Branch: autoresearch/mar28
Results: results.tsv (24 rows)
========================================
This is the core of AutoResearch. It's a Markdown file that tells the agent everything it needs to know. Here's the structure:
# AutoResearch: [Project Name]
## Objective
What metric to optimize, and the run command.
## Editable Files
Which files the agent can modify (and what's in them).
## Locked Files
What NOT to touch.
## Evaluation
How to extract the metric from output.
What counts as an improvement.
## Constraints
Hard rules (no new deps, no API changes, etc).
## Strategy
What to try first. What to avoid.
## Results Format
Column definitions for results.tsv.The key insight: you iterate on program.md over time. After a session, review what worked, refine the strategy, and run again. The strategy document IS your code.
The examples/ directory contains ready-to-use program.md templates:
| Example | Metric | Use Case |
|---|---|---|
ml-training/ |
val_bpb |
GPT training optimization (Karpathy-style) |
web-performance/ |
bundle_size_kb |
Next.js bundle size reduction |
api-latency/ |
avg_latency_ms |
API response time optimization |
flutter-startup/ |
startup_ms |
Flutter cold-start time |
Copy any example to your project and customize:
cp examples/api-latency/program.md ~/my-api-project/program.md┌──────────────────────────────────────────────────┐
│ YOUR PROJECT │
│ │
│ program.md <- You write this (strategy) │
│ results.tsv <- Agent writes this (log) │
│ train.py / app.js <- Agent edits this (code) │
│ │
└──────────────┬───────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────┐
│ CLAUDE CODE AGENT │
│ │
│ /autoresearch 2h │
│ │
│ ┌─────────────────────────────────────────────┐ │
│ │ 1. Read program.md │ │
│ │ 2. Plan experiment (hypothesis) │ │
│ │ 3. Edit code (single-variable change) │ │
│ │ 4. Git commit │ │
│ │ 5. Run command │ │
│ │ 6. Extract metric │ │
│ │ 7. Keep / Revert │ │
│ │ 8. Log to results.tsv │ │
│ │ 9. Check time remaining │ │
│ │ 10. GOTO 1 │ │
│ └─────────────────────────────────────────────┘ │
│ │
│ Every 5: mini-review | Every 10: research log │
│ Time's up: session report + stop │
│ │
└──────────────────────────────────────────────────┘
- One file at a time — Single-variable changes so you know what helped
- Always revert failures — Never build on broken code
- Commit everything — Every experiment is a git commit (your lab notebook)
- Simplicity wins — A tiny improvement + 20 lines of hack? Not worth it
- No new dependencies — Work with what you have
- Time-boxed — Set a budget, get a report, review, iterate
- Claude Code (CLI, desktop app, or IDE extension)
- A project with a measurable metric and a run command
- Git (for experiment tracking)
autoresearch/
├── commands/
│ ├── autoresearch.md # Main experiment loop skill
│ └── autoresearch-init.md # Strategy generator skill
├── examples/
│ ├── ml-training/
│ │ └── program.md # GPT training optimization
│ ├── web-performance/
│ │ └── program.md # Next.js bundle size
│ ├── api-latency/
│ │ └── program.md # API response time
│ └── flutter-startup/
│ └── program.md # Flutter cold-start
├── install.sh # One-line installer
├── LICENSE # MIT
└── README.md # You are here
Inspired by Andrej Karpathy's autoresearch — the original "AI agents running research overnight" concept. This project generalizes the idea beyond ML training into reusable Claude Code slash commands for any codebase.
MIT