A multi-agent crew that answers business questions no single source can answer. Ask "how many May orders qualified for free shipping?" and the answer needs two different places at once: the shipping policy (which states the $75 threshold) and the orders database (which holds the transactions). One agent cannot do both jobs well. A crew can.
$ python main.py "How many orders this May qualified for free shipping under our policy?"
Based on the evidence, 57 orders this May qualified for free shipping [sql-1].
These orders met both key criteria from the shipping policy: they were placed
by U.S.-based customers and had a merchandise subtotal of $75.00 or more after
discounts [shipping-policy]. International orders were excluded, as they are
ineligible for free shipping regardless of order value [shipping-policy].
That answer is a real captured run, and the detail worth noticing is the
U.S.-only filter: nobody asked for it. The SQL worker read "international
orders do not qualify" in the retrieved policy passage and added
c.country = 'United States' to its own query. That is what evidence-driven
querying buys you.
Four roles, one model, different jobs:
| role | job | trust model |
|---|---|---|
| Planner | decompose the question, pick workers | its plan is validated in code: unknown workers dropped, empty plan = honest failure |
| SQL worker | write and run queries against the live schema | validator + read-only connection + LIMIT injection + timeout + self-correction loop |
| Doc worker | retrieve policy passages | not an LLM: pure-Python BM25, deterministic and explainable |
| Verifier | judge whether the evidence supports an answer | objections are fed back and the SQL is redone once; still short = the crew refuses, stating what is missing |
Three design choices worth defending:
- The model's routing is a suggestion, not a guarantee. Everything the planner returns is checked in code before a worker runs. Multi-agent systems fail at the seams, so the seams are where the validation lives.
- Retrieval is not a model call. BM25 in ~60 lines of stdlib Python ranks the policy passages. It is fast, free, deterministic, and its ranking can be explained. The model interprets evidence; it does not fetch it.
- Evidence flows forward. Doc subtasks run first and their passages are handed to the SQL worker as context, because "count orders above the threshold" is only writable when the model can see the threshold.
The SQL worker corrects itself against database errors: a failed query comes back with the error message and gets rewritten, up to a bounded number of tries. The crew corrects itself against verifier objections: when the verifier rejects a round of evidence, its specific complaints are threaded into the SQL worker's context and the queries are redone once.
This is not theoretical. In the first live run of this project, the planner paraphrased the free-shipping subtask without the dollar amount, and the SQL worker, unable to see the policy, guessed a $50 threshold. The verifier caught it: it cross-checked the query against the retrieved policy passage, listed the exact mismatches (threshold, discounts, international orders), and blocked the answer. That failure became the feedback loop and the regression test the project now ships with.
The verifier sits between the workers and the final answer. If the SQL worker exhausted its retries, or the policy passage that defines a threshold was not found, the crew does not synthesize confident prose over the gap. It answers:
I cannot answer this reliably: the free-shipping threshold from the policy
is missing from the evidence.
Every fact in a successful answer carries a citation marker ([sql-1],
[refund-policy]) pointing at the evidence that produced it, and --verbose
prints the full trace: the plan, every SQL attempt including the failed ones,
the retrieved passages, and the verifier's verdict.
data/generate.py builds a deterministic e-commerce database (customers,
products, orders, order items) and data/docs/ holds five policy documents
(shipping, refunds, VIP program, support SLAs, vendor terms). They interlock
on purpose: the docs state the business rules, the database holds the
transactions those rules apply to. Questions that need both:
| question | docs provide | database provides |
|---|---|---|
| "How many May orders shipped free?" | the $75 threshold | the order totals |
| "Which customers are VIP?" | the $2,000 lifetime rule | per-customer spend |
| "How many orders could still be cancelled?" | "shipped cannot be cancelled" | status counts |
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python data/generate.py # builds data/company.db
export ANTHROPIC_API_KEY=sk-ant-...python main.py "Which policy decides if an order can be cancelled, and how many are still cancellable?"
python main.py --verbose "How many customers qualify as VIP under our program rules?"
python main.py # interactive modeThe full orchestration runs in tests with no API key and no network: the model is replaced by a scripted double, so plan validation, worker dispatch, SQL self-correction, the verifier gate and honest refusal are all asserted deterministically.
python -m tests.test_workers
python -m tests.test_crew
# or: pytest| path | job |
|---|---|
app/crew.py |
the orchestrator: plan, dispatch, verify, answer or refuse |
app/workers.py |
the SQL worker (self-correcting) and the doc worker (BM25) |
app/llm.py |
all four model roles behind one interface (+ scripted test double) |
app/sqlsafe.py |
read-only SQL validation and LIMIT injection |
app/db.py |
read-only connection, query timeout, live schema description |
data/generate.py |
deterministic company database |
data/docs/ |
the policy corpus the doc worker searches |
tests/ |
workers and full-crew orchestration tests, all key-free |
- Workers gather evidence; they do not answer. The separation keeps each piece testable alone and makes the final answer auditable.
- Same defence-in-depth as a standalone SQL agent. The database connection
is opened
mode=ro, so a write is physically rejected even if it slipped the validator. - Swap the company, keep the crew. Point the config at another SQLite file and another folder of .md policies; the schema is read live and the corpus is indexed at startup.
