Skip to content

techygarry/autoresearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AutoResearch for Claude Code

Let AI run 100 experiments while you sleep. Autonomous experiment loops for any codebase — not just ML.

Inspired by @karpathy's autoresearch, rebuilt as reusable Claude Code slash commands that work on any project: ML training, API performance, bundle size, startup time, test coverage — anything with a measurable metric.


The Idea

You write a program.md — a strategy document that tells the agent what to optimize, what files to touch, and what to try. Then you run /autoresearch 8h and go to sleep. The agent:

  1. Reads your strategy (program.md)
  2. Modifies code (only the files you allow)
  3. Runs the experiment
  4. Evaluates the metric
  5. Keeps improvements, reverts failures
  6. Logs everything to results.tsv
  7. Repeats until time runs out

You wake up to 50-100 experiments, a clear results log, and your code on the best-performing version.

You don't program Python. You program program.md. That's the whole point.


Quick Start

Install (30 seconds)

git clone https://github.com/adsol-digital/autoresearch.git
cd autoresearch
chmod +x install.sh
./install.sh

This copies two slash commands to ~/.claude/commands/ — available globally in every project.

Or install manually

mkdir -p ~/.claude/commands
cp commands/autoresearch.md ~/.claude/commands/
cp commands/autoresearch-init.md ~/.claude/commands/

Usage

Step 1: Generate your strategy

cd your-project/

Then in Claude Code:

/autoresearch-init

This scans your codebase and generates a program.md tailored to your project. It asks you:

  • What metric to optimize
  • What command runs the experiment
  • Which files the agent can edit
  • What constraints to follow

Step 2: Run experiments

/autoresearch 2h

That's it. The agent enters an autonomous loop for 2 hours.


Duration Options

Command Duration
/autoresearch 30m 30 minutes
/autoresearch 2h 2 hours
/autoresearch 2h30m 2 hours 30 minutes
/autoresearch overnight 8 hours
/autoresearch all-day 12 hours
/autoresearch Unlimited (until you interrupt)

Combine with a run tag:

/autoresearch mar28 2h

This runs for 2 hours on git branch autoresearch/mar28.


What Happens During a Run

Each experiment follows this cycle:

Plan -> Implement -> Run -> Evaluate -> Record -> Log -> Adapt -> Loop

The agent prints a summary after every experiment:

--- Experiment 14 ---
Change: Reduced n_layer from 12 to 8, increased n_embd to 1024
Hypothesis: Wider-shallower model trains faster in 5min budget
Result: val_bpb = 1.087 (-0.012 vs baseline) -> keep
Best so far: 1.087 (experiment 14)
Total experiments: 14 | Kept: 6 | Discarded: 7 | Crashed: 1
Elapsed: 1h 10m | Remaining: 50m
---

When time's up, you get a full session report:

========================================
  AUTORESEARCH SESSION COMPLETE
========================================
Duration: 2h 0m
Total experiments: 24
  Kept: 8 | Discarded: 14 | Crashed: 2

Best result: val_bpb = 1.052 (experiment 19)
  vs baseline: -0.047 (4.3% improvement)

Top 3 improvements:
  1. Experiment 19: Cosine LR with warm restarts (-0.015)
  2. Experiment 14: Wider-shallower architecture (-0.012)
  3. Experiment 7: Increased batch size to 32K (-0.009)

Key findings:
  - Width matters more than depth at this compute budget
  - LR schedule changes had the biggest single impact
  - Optimizer changes (betas) had minimal effect

Branch: autoresearch/mar28
Results: results.tsv (24 rows)
========================================

The program.md File

This is the core of AutoResearch. It's a Markdown file that tells the agent everything it needs to know. Here's the structure:

# AutoResearch: [Project Name]

## Objective
What metric to optimize, and the run command.

## Editable Files
Which files the agent can modify (and what's in them).

## Locked Files
What NOT to touch.

## Evaluation
How to extract the metric from output.
What counts as an improvement.

## Constraints
Hard rules (no new deps, no API changes, etc).

## Strategy
What to try first. What to avoid.

## Results Format
Column definitions for results.tsv.

The key insight: you iterate on program.md over time. After a session, review what worked, refine the strategy, and run again. The strategy document IS your code.


Examples

The examples/ directory contains ready-to-use program.md templates:

Example Metric Use Case
ml-training/ val_bpb GPT training optimization (Karpathy-style)
web-performance/ bundle_size_kb Next.js bundle size reduction
api-latency/ avg_latency_ms API response time optimization
flutter-startup/ startup_ms Flutter cold-start time

Copy any example to your project and customize:

cp examples/api-latency/program.md ~/my-api-project/program.md

How It Works Under the Hood

┌──────────────────────────────────────────────────┐
│                  YOUR PROJECT                     │
│                                                   │
│  program.md          <- You write this (strategy) │
│  results.tsv         <- Agent writes this (log)   │
│  train.py / app.js   <- Agent edits this (code)   │
│                                                   │
└──────────────┬───────────────────────────────────┘
               │
               ▼
┌──────────────────────────────────────────────────┐
│              CLAUDE CODE AGENT                    │
│                                                   │
│  /autoresearch 2h                                 │
│                                                   │
│  ┌─────────────────────────────────────────────┐  │
│  │  1. Read program.md                         │  │
│  │  2. Plan experiment (hypothesis)            │  │
│  │  3. Edit code (single-variable change)      │  │
│  │  4. Git commit                              │  │
│  │  5. Run command                             │  │
│  │  6. Extract metric                          │  │
│  │  7. Keep / Revert                           │  │
│  │  8. Log to results.tsv                      │  │
│  │  9. Check time remaining                    │  │
│  │  10. GOTO 1                                 │  │
│  └─────────────────────────────────────────────┘  │
│                                                   │
│  Every 5: mini-review  |  Every 10: research log  │
│  Time's up: session report + stop                 │
│                                                   │
└──────────────────────────────────────────────────┘

Key Design Principles

  1. One file at a time — Single-variable changes so you know what helped
  2. Always revert failures — Never build on broken code
  3. Commit everything — Every experiment is a git commit (your lab notebook)
  4. Simplicity wins — A tiny improvement + 20 lines of hack? Not worth it
  5. No new dependencies — Work with what you have
  6. Time-boxed — Set a budget, get a report, review, iterate

Requirements

  • Claude Code (CLI, desktop app, or IDE extension)
  • A project with a measurable metric and a run command
  • Git (for experiment tracking)

Project Structure

autoresearch/
├── commands/
│   ├── autoresearch.md        # Main experiment loop skill
│   └── autoresearch-init.md   # Strategy generator skill
├── examples/
│   ├── ml-training/
│   │   └── program.md         # GPT training optimization
│   ├── web-performance/
│   │   └── program.md         # Next.js bundle size
│   ├── api-latency/
│   │   └── program.md         # API response time
│   └── flutter-startup/
│       └── program.md         # Flutter cold-start
├── install.sh                 # One-line installer
├── LICENSE                    # MIT
└── README.md                  # You are here

Credits

Inspired by Andrej Karpathy's autoresearch — the original "AI agents running research overnight" concept. This project generalizes the idea beyond ML training into reusable Claude Code slash commands for any codebase.


License

MIT

About

Autonomous experiment loops for Claude Code. Let AI run 100 experiments while you sleep. Works on any codebase.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages