AutoResearch for Claude Code

Let AI run 100 experiments while you sleep. Autonomous experiment loops for any codebase — not just ML.

Inspired by @karpathy's autoresearch, rebuilt as reusable Claude Code slash commands that work on any project: ML training, API performance, bundle size, startup time, test coverage — anything with a measurable metric.

The Idea

You write a program.md — a strategy document that tells the agent what to optimize, what files to touch, and what to try. Then you run /autoresearch 8h and go to sleep. The agent:

Reads your strategy (program.md)
Modifies code (only the files you allow)
Runs the experiment
Evaluates the metric
Keeps improvements, reverts failures
Logs everything to results.tsv
Repeats until time runs out

You wake up to 50-100 experiments, a clear results log, and your code on the best-performing version.

You don't program Python. You program program.md. That's the whole point.

Quick Start

Install (30 seconds)

git clone https://github.com/adsol-digital/autoresearch.git
cd autoresearch
chmod +x install.sh
./install.sh

This copies two slash commands to ~/.claude/commands/ — available globally in every project.

Or install manually

mkdir -p ~/.claude/commands
cp commands/autoresearch.md ~/.claude/commands/
cp commands/autoresearch-init.md ~/.claude/commands/

Usage

Step 1: Generate your strategy

cd your-project/

Then in Claude Code:

/autoresearch-init

This scans your codebase and generates a program.md tailored to your project. It asks you:

What metric to optimize
What command runs the experiment
Which files the agent can edit
What constraints to follow

Step 2: Run experiments

/autoresearch 2h

That's it. The agent enters an autonomous loop for 2 hours.

Duration Options

Command	Duration
`/autoresearch 30m`	30 minutes
`/autoresearch 2h`	2 hours
`/autoresearch 2h30m`	2 hours 30 minutes
`/autoresearch overnight`	8 hours
`/autoresearch all-day`	12 hours
`/autoresearch`	Unlimited (until you interrupt)

Combine with a run tag:

/autoresearch mar28 2h

This runs for 2 hours on git branch autoresearch/mar28.

What Happens During a Run

Each experiment follows this cycle:

Plan -> Implement -> Run -> Evaluate -> Record -> Log -> Adapt -> Loop

The agent prints a summary after every experiment:

--- Experiment 14 ---
Change: Reduced n_layer from 12 to 8, increased n_embd to 1024
Hypothesis: Wider-shallower model trains faster in 5min budget
Result: val_bpb = 1.087 (-0.012 vs baseline) -> keep
Best so far: 1.087 (experiment 14)
Total experiments: 14 | Kept: 6 | Discarded: 7 | Crashed: 1
Elapsed: 1h 10m | Remaining: 50m
---

When time's up, you get a full session report:

========================================
  AUTORESEARCH SESSION COMPLETE
========================================
Duration: 2h 0m
Total experiments: 24
  Kept: 8 | Discarded: 14 | Crashed: 2

Best result: val_bpb = 1.052 (experiment 19)
  vs baseline: -0.047 (4.3% improvement)

Top 3 improvements:
  1. Experiment 19: Cosine LR with warm restarts (-0.015)
  2. Experiment 14: Wider-shallower architecture (-0.012)
  3. Experiment 7: Increased batch size to 32K (-0.009)

Key findings:
  - Width matters more than depth at this compute budget
  - LR schedule changes had the biggest single impact
  - Optimizer changes (betas) had minimal effect

Branch: autoresearch/mar28
Results: results.tsv (24 rows)
========================================

The program.md File

This is the core of AutoResearch. It's a Markdown file that tells the agent everything it needs to know. Here's the structure:

# AutoResearch: [Project Name]

## Objective
What metric to optimize, and the run command.

## Editable Files
Which files the agent can modify (and what's in them).

## Locked Files
What NOT to touch.

## Evaluation
How to extract the metric from output.
What counts as an improvement.

## Constraints
Hard rules (no new deps, no API changes, etc).

## Strategy
What to try first. What to avoid.

## Results Format
Column definitions for results.tsv.

The key insight: you iterate on program.md over time. After a session, review what worked, refine the strategy, and run again. The strategy document IS your code.

Examples

The examples/ directory contains ready-to-use program.md templates:

Example	Metric	Use Case
`ml-training/`	`val_bpb`	GPT training optimization (Karpathy-style)
`web-performance/`	`bundle_size_kb`	Next.js bundle size reduction
`api-latency/`	`avg_latency_ms`	API response time optimization
`flutter-startup/`	`startup_ms`	Flutter cold-start time

Copy any example to your project and customize:

cp examples/api-latency/program.md ~/my-api-project/program.md

How It Works Under the Hood

┌──────────────────────────────────────────────────┐
│                  YOUR PROJECT                     │
│                                                   │
│  program.md          <- You write this (strategy) │
│  results.tsv         <- Agent writes this (log)   │
│  train.py / app.js   <- Agent edits this (code)   │
│                                                   │
└──────────────┬───────────────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────────────┐
│              CLAUDE CODE AGENT                    │
│                                                   │
│  /autoresearch 2h                                 │
│                                                   │
│  ┌─────────────────────────────────────────────┐  │
│  │  1. Read program.md                         │  │
│  │  2. Plan experiment (hypothesis)            │  │
│  │  3. Edit code (single-variable change)      │  │
│  │  4. Git commit                              │  │
│  │  5. Run command                             │  │
│  │  6. Extract metric                          │  │
│  │  7. Keep / Revert                           │  │
│  │  8. Log to results.tsv                      │  │
│  │  9. Check time remaining                    │  │
│  │  10. GOTO 1                                 │  │
│  └─────────────────────────────────────────────┘  │
│                                                   │
│  Every 5: mini-review  |  Every 10: research log  │
│  Time's up: session report + stop                 │
│                                                   │
└──────────────────────────────────────────────────┘

Key Design Principles

One file at a time — Single-variable changes so you know what helped
Always revert failures — Never build on broken code
Commit everything — Every experiment is a git commit (your lab notebook)
Simplicity wins — A tiny improvement + 20 lines of hack? Not worth it
No new dependencies — Work with what you have
Time-boxed — Set a budget, get a report, review, iterate

Requirements

Claude Code (CLI, desktop app, or IDE extension)
A project with a measurable metric and a run command
Git (for experiment tracking)

Project Structure

autoresearch/
├── commands/
│   ├── autoresearch.md        # Main experiment loop skill
│   └── autoresearch-init.md   # Strategy generator skill
├── examples/
│   ├── ml-training/
│   │   └── program.md         # GPT training optimization
│   ├── web-performance/
│   │   └── program.md         # Next.js bundle size
│   ├── api-latency/
│   │   └── program.md         # API response time
│   └── flutter-startup/
│       └── program.md         # Flutter cold-start
├── install.sh                 # One-line installer
├── LICENSE                    # MIT
└── README.md                  # You are here

Credits

Inspired by Andrej Karpathy's autoresearch — the original "AI agents running research overnight" concept. This project generalizes the idea beyond ML training into reusable Claude Code slash commands for any codebase.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoResearch for Claude Code

The Idea

Quick Start

Install (30 seconds)

Or install manually

Usage

Step 1: Generate your strategy

Step 2: Run experiments

Duration Options

What Happens During a Run

The program.md File

Examples

How It Works Under the Hood

Key Design Principles

Requirements

Project Structure

Credits

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
commands		commands
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh

Folders and files

Latest commit

History

Repository files navigation

AutoResearch for Claude Code

The Idea

Quick Start

Install (30 seconds)

Or install manually

Usage

Step 1: Generate your strategy

Step 2: Run experiments

Duration Options

What Happens During a Run

The program.md File

Examples

How It Works Under the Hood

Key Design Principles

Requirements

Project Structure

Credits

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages