multi-agent-analyst

A multi-agent crew that answers business questions no single source can answer. Ask "how many May orders qualified for free shipping?" and the answer needs two different places at once: the shipping policy (which states the $75 threshold) and the orders database (which holds the transactions). One agent cannot do both jobs well. A crew can.

$ python main.py "How many orders this May qualified for free shipping under our policy?"

Based on the evidence, 57 orders this May qualified for free shipping [sql-1].
These orders met both key criteria from the shipping policy: they were placed
by U.S.-based customers and had a merchandise subtotal of $75.00 or more after
discounts [shipping-policy]. International orders were excluded, as they are
ineligible for free shipping regardless of order value [shipping-policy].

That answer is a real captured run, and the detail worth noticing is the U.S.-only filter: nobody asked for it. The SQL worker read "international orders do not qualify" in the retrieved policy passage and added c.country = 'United States' to its own query. That is what evidence-driven querying buys you.

The architecture

Four roles, one model, different jobs:

role	job	trust model
Planner	decompose the question, pick workers	its plan is validated in code: unknown workers dropped, empty plan = honest failure
SQL worker	write and run queries against the live schema	validator + read-only connection + LIMIT injection + timeout + self-correction loop
Doc worker	retrieve policy passages	not an LLM: pure-Python BM25, deterministic and explainable
Verifier	judge whether the evidence supports an answer	objections are fed back and the SQL is redone once; still short = the crew refuses, stating what is missing

Three design choices worth defending:

The model's routing is a suggestion, not a guarantee. Everything the planner returns is checked in code before a worker runs. Multi-agent systems fail at the seams, so the seams are where the validation lives.
Retrieval is not a model call. BM25 in ~60 lines of stdlib Python ranks the policy passages. It is fast, free, deterministic, and its ranking can be explained. The model interprets evidence; it does not fetch it.
Evidence flows forward. Doc subtasks run first and their passages are handed to the SQL worker as context, because "count orders above the threshold" is only writable when the model can see the threshold.

Self-correction at two levels

The SQL worker corrects itself against database errors: a failed query comes back with the error message and gets rewritten, up to a bounded number of tries. The crew corrects itself against verifier objections: when the verifier rejects a round of evidence, its specific complaints are threaded into the SQL worker's context and the queries are redone once.

This is not theoretical. In the first live run of this project, the planner paraphrased the free-shipping subtask without the dollar amount, and the SQL worker, unable to see the policy, guessed a $50 threshold. The verifier caught it: it cross-checked the query against the retrieved policy passage, listed the exact mismatches (threshold, discounts, international orders), and blocked the answer. That failure became the feedback loop and the regression test the project now ships with.

Honesty is a feature

The verifier sits between the workers and the final answer. If the SQL worker exhausted its retries, or the policy passage that defines a threshold was not found, the crew does not synthesize confident prose over the gap. It answers:

I cannot answer this reliably: the free-shipping threshold from the policy
is missing from the evidence.

Every fact in a successful answer carries a citation marker ([sql-1], [refund-policy]) pointing at the evidence that produced it, and --verbose prints the full trace: the plan, every SQL attempt including the failed ones, the retrieved passages, and the verifier's verdict.

The sample company

data/generate.py builds a deterministic e-commerce database (customers, products, orders, order items) and data/docs/ holds five policy documents (shipping, refunds, VIP program, support SLAs, vendor terms). They interlock on purpose: the docs state the business rules, the database holds the transactions those rules apply to. Questions that need both:

question	docs provide	database provides
"How many May orders shipped free?"	the $75 threshold	the order totals
"Which customers are VIP?"	the $2,000 lifetime rule	per-customer spend
"How many orders could still be cancelled?"	"shipped cannot be cancelled"	status counts

Setup

python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python data/generate.py            # builds data/company.db
export ANTHROPIC_API_KEY=sk-ant-...

Run

python main.py "Which policy decides if an order can be cancelled, and how many are still cancellable?"
python main.py --verbose "How many customers qualify as VIP under our program rules?"
python main.py                      # interactive mode

Tests

The full orchestration runs in tests with no API key and no network: the model is replaced by a scripted double, so plan validation, worker dispatch, SQL self-correction, the verifier gate and honest refusal are all asserted deterministically.

python -m tests.test_workers
python -m tests.test_crew
# or: pytest

Project layout

path	job
`app/crew.py`	the orchestrator: plan, dispatch, verify, answer or refuse
`app/workers.py`	the SQL worker (self-correcting) and the doc worker (BM25)
`app/llm.py`	all four model roles behind one interface (+ scripted test double)
`app/sqlsafe.py`	read-only SQL validation and LIMIT injection
`app/db.py`	read-only connection, query timeout, live schema description
`data/generate.py`	deterministic company database
`data/docs/`	the policy corpus the doc worker searches
`tests/`	workers and full-crew orchestration tests, all key-free

Design notes

Workers gather evidence; they do not answer. The separation keeps each piece testable alone and makes the final answer auditable.
Same defence-in-depth as a standalone SQL agent. The database connection is opened mode=ro, so a write is physically rejected even if it slipped the validator.
Swap the company, keep the crew. Point the config at another SQLite file and another folder of .md policies; the schema is read live and the corpus is indexed at startup.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

multi-agent-analyst

The architecture

Self-correction at two levels

Honesty is a feature

The sample company

Setup

Run

Tests

Project layout

Design notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
app		app
data		data
docs		docs
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

multi-agent-analyst

The architecture

Self-correction at two levels

Honesty is a feature

The sample company

Setup

Run

Tests

Project layout

Design notes

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages