Skip to content

cmosse/claude-autoresearch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

claude-autoresearch

CI License: MIT

An LLM agent as the optimizer in a code-search loop.

TL;DR — An AI agent rewrites one file, a fast scorer grades each version on data it was never allowed to practice on, and the best survives — thousands of iterations a night. The scorer is built to resist overfitting, so the search chases edges that generalise. And because the loop is driven by a Claude Code session, not metered API calls, a search that long costs a flat subscription fee instead of a per-token bill. New here? Read the 2-minute ELI5.

The search loop: rewrite, build, score on unseen data, compare, keep — repeat

The agent edits one source file — a candidate — rebuilds it, runs a fast deterministic evaluator, reads a single fitness score, and keeps or discards the change. Then it does it again. And again. Overnight, thousands of times.

It is FunSearch / AlphaEvolve-class program search, with three deliberate choices: the mutation operator is a general-purpose LLM editing real compilable code; the fitness function is built to resist overfitting so the search chases edges that generalise instead of artifacts that memorise; and the loop is driven by a Claude Code session rather than the metered API. The last point — plus a Rust evaluator fast enough to iterate — is what makes a thousands-of-candidates search practical at all (see why we built our own harness).

This repo is the harness. It ships with two worked examples — a trading-strategy search and a bin-packing-heuristic search — that share the same ~250-line domain-free core.


The loop

Each iteration: the agent reads the candidate and recent results, rewrites the candidate file, the harness builds it and scores it on training, held-out, and walk-forward data, the agent compares final_score to its best so far, and keeps the change (a git commit) or discards it (a git reset). Then it does it again — see the diagram above.

The agent never sees the evaluator's internals — only the candidate file it edits and the final_score it reads back. Every kept step is a git commit, so the search history is the git history and any step is revertible.

Why min(train, holdout, walk-forward)

A search that maximises an in-sample score will find ways to cheat it. The harness makes cheating not pay:

  • Holdout — a slice of data the candidate's author never iterates against.
  • Walk-forward — the training data re-cut into overlapping sub-windows; the worst window's score is the one that counts.
  • final_score is the minimum across all of them.

A candidate that overfits scores high on train and collapses elsewhere — and the minimum drags its reported score down to the collapse. The agent, maximising final_score, is pushed toward edges that hold up everywhere. This is the core idea, and it lives in core/ in about 250 lines.

Why we built our own harness

A search is only as good as it is long — thousands of candidates, not dozens. Two costs decide whether a search that long is practical at all, and getting both right is the reason this is a purpose-built harness rather than a generic framework.

1. A flat-fee loop, not a metered one

Every iteration is an LLM turn, and a search is thousands of them.

Priced through the metered Anthropic API, that is a per-token bill that grows with the length of the search. Concretely — roughly four weeks of autoresearch in the project this harness grew out of logged ~10,300 model messages; priced at Anthropic's published Opus 4.7 API rates with prompt caching already applied, that comes to ≈ $5,900.

That figure is not a hand-wave. It comes from scripts/session_cost.py, which reads the Claude Code session JSONL files on disk, sums per-message input_tokens, output_tokens, cache_creation_input_tokens and cache_read_input_tokens across every assistant turn, and prices each class at its published rate ($15 / $75 / $18.75 / $30 / $1.50 per MTok respectively). Point it at any session you have and it reports the same number for that session — so the methodology is auditable, and you can re-run it on your own runs.

Run as Claude Code sessions, the identical work falls under a flat subscription — on the order of $100–200/month (approximate, tier-dependent), within its usage limits. That is roughly a 30× difference in the cost of the same search.

It is also zero orchestration to build: Claude Code already edits files, runs builds, parses output, commits to git, and loops. FunSearch and AlphaEvolve had to engineer all of that machinery — here autoresearch-core is ~250 lines because the orchestration is a tool you already have.

2. A Rust evaluator, fast enough to iterate

A search only makes as many attempts a night as the evaluator allows. The trading harness is built on a Rust backtest engine, ported from a Python reference implementation — and the project benchmarked the port:

Workload Python Rust Speedup
One backtest — 1M ticks, 24 h ~30 s 161 ms ~190×
Grid sweep — 1920 configs ~16 h 112 s ~500×

At ~30 s per backtest an overnight run gets through a few hundred candidates; at ~160 ms it gets through thousands. Same loop, same idea — but only the fast evaluator finishes a useful search by morning. (An on-disk tick cache adds a further ~118× on data loading, so repeated runs pay the parse cost once.)

Together: cheap enough and fast enough that "leave it running overnight" is a real workflow, not a budget line. program.md is plain instructions, so the harness still works wired to the raw API — but the path this repo is built for is a Claude Code session. See docs/agent-loop.md.

Quick start

The bin-packing example needs no data — instances come from a seeded PRNG — so it runs on a bare clone:

cargo run -p binpacking-search --release
train_score:      0.8724
holdout_score:    0.8651
wf_min_score:     0.8653
final_score:      0.8651          <- the number the agent maximises
train.vs_baseline: -0.0065        <- the starter heuristic; the agent's job is to beat it

The two examples

Example Candidate the agent edits Dataset Score
binpacking-search An online bin-packing heuristic Seeded PRNG — no download Mean packing efficiency
trading-search A tick-level trading strategy Bundled market-data sample Gated Probabilistic Sharpe Ratio

They share autoresearch-core and differ only in the domain plug-in. That is the point: the harness is not a backtester with extra steps — trading is just one Experiment implementation.

Run the agent loop

The loop is driven by an LLM coding agent (built for Claude Code). Point the agent at an example's program.md and let it iterate. See docs/agent-loop.md.

Add your own domain

Implement the Experiment trait (two methods), write a program.md, drop in a starter candidate. The harness handles train/holdout/walk-forward scoring, the keep/discard ratchet, and the agent protocol. See docs/adding-an-experiment.md.

Repository layout

core/                  autoresearch-core — domain-free harness (~250 LOC)
examples/binpacking/   non-trading example: bin-packing heuristic search
examples/trading/      trading example: strategy search
  engine/              trading-engine — tick-level backtest engine
docs/                  architecture, the agent loop, adding a domain

License

MIT — see LICENSE.


This project is not affiliated with, endorsed by, or sponsored by Anthropic. "Claude" and "Claude Code" are products of Anthropic; this is an independent piece of software that targets that runtime.

About

An LLM rewrites your code, a fast Rust scorer grades it on held-out data, the best survives — thousands of iterations a night. Runs as a flat-fee Claude Code session, not a metered Anthropic API loop.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors