Skip to content

vinimabreu/multi-agent-analyst

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

multi-agent-analyst

Python Claude Pattern License

A multi-agent crew that answers business questions no single source can answer. Ask "how many May orders qualified for free shipping?" and the answer needs two different places at once: the shipping policy (which states the $75 threshold) and the orders database (which holds the transactions). One agent cannot do both jobs well. A crew can.

$ python main.py "How many orders this May qualified for free shipping under our policy?"

Based on the evidence, 57 orders this May qualified for free shipping [sql-1].
These orders met both key criteria from the shipping policy: they were placed
by U.S.-based customers and had a merchandise subtotal of $75.00 or more after
discounts [shipping-policy]. International orders were excluded, as they are
ineligible for free shipping regardless of order value [shipping-policy].

That answer is a real captured run, and the detail worth noticing is the U.S.-only filter: nobody asked for it. The SQL worker read "international orders do not qualify" in the retrieved policy passage and added c.country = 'United States' to its own query. That is what evidence-driven querying buys you.

The architecture

Architecture

Four roles, one model, different jobs:

role job trust model
Planner decompose the question, pick workers its plan is validated in code: unknown workers dropped, empty plan = honest failure
SQL worker write and run queries against the live schema validator + read-only connection + LIMIT injection + timeout + self-correction loop
Doc worker retrieve policy passages not an LLM: pure-Python BM25, deterministic and explainable
Verifier judge whether the evidence supports an answer objections are fed back and the SQL is redone once; still short = the crew refuses, stating what is missing

Three design choices worth defending:

  • The model's routing is a suggestion, not a guarantee. Everything the planner returns is checked in code before a worker runs. Multi-agent systems fail at the seams, so the seams are where the validation lives.
  • Retrieval is not a model call. BM25 in ~60 lines of stdlib Python ranks the policy passages. It is fast, free, deterministic, and its ranking can be explained. The model interprets evidence; it does not fetch it.
  • Evidence flows forward. Doc subtasks run first and their passages are handed to the SQL worker as context, because "count orders above the threshold" is only writable when the model can see the threshold.

Self-correction at two levels

The SQL worker corrects itself against database errors: a failed query comes back with the error message and gets rewritten, up to a bounded number of tries. The crew corrects itself against verifier objections: when the verifier rejects a round of evidence, its specific complaints are threaded into the SQL worker's context and the queries are redone once.

This is not theoretical. In the first live run of this project, the planner paraphrased the free-shipping subtask without the dollar amount, and the SQL worker, unable to see the policy, guessed a $50 threshold. The verifier caught it: it cross-checked the query against the retrieved policy passage, listed the exact mismatches (threshold, discounts, international orders), and blocked the answer. That failure became the feedback loop and the regression test the project now ships with.

Honesty is a feature

The verifier sits between the workers and the final answer. If the SQL worker exhausted its retries, or the policy passage that defines a threshold was not found, the crew does not synthesize confident prose over the gap. It answers:

I cannot answer this reliably: the free-shipping threshold from the policy
is missing from the evidence.

Every fact in a successful answer carries a citation marker ([sql-1], [refund-policy]) pointing at the evidence that produced it, and --verbose prints the full trace: the plan, every SQL attempt including the failed ones, the retrieved passages, and the verifier's verdict.

The sample company

data/generate.py builds a deterministic e-commerce database (customers, products, orders, order items) and data/docs/ holds five policy documents (shipping, refunds, VIP program, support SLAs, vendor terms). They interlock on purpose: the docs state the business rules, the database holds the transactions those rules apply to. Questions that need both:

question docs provide database provides
"How many May orders shipped free?" the $75 threshold the order totals
"Which customers are VIP?" the $2,000 lifetime rule per-customer spend
"How many orders could still be cancelled?" "shipped cannot be cancelled" status counts

Setup

python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python data/generate.py            # builds data/company.db
export ANTHROPIC_API_KEY=sk-ant-...

Run

python main.py "Which policy decides if an order can be cancelled, and how many are still cancellable?"
python main.py --verbose "How many customers qualify as VIP under our program rules?"
python main.py                      # interactive mode

Tests

The full orchestration runs in tests with no API key and no network: the model is replaced by a scripted double, so plan validation, worker dispatch, SQL self-correction, the verifier gate and honest refusal are all asserted deterministically.

python -m tests.test_workers
python -m tests.test_crew
# or: pytest

Project layout

path job
app/crew.py the orchestrator: plan, dispatch, verify, answer or refuse
app/workers.py the SQL worker (self-correcting) and the doc worker (BM25)
app/llm.py all four model roles behind one interface (+ scripted test double)
app/sqlsafe.py read-only SQL validation and LIMIT injection
app/db.py read-only connection, query timeout, live schema description
data/generate.py deterministic company database
data/docs/ the policy corpus the doc worker searches
tests/ workers and full-crew orchestration tests, all key-free

Design notes

  • Workers gather evidence; they do not answer. The separation keeps each piece testable alone and makes the final answer auditable.
  • Same defence-in-depth as a standalone SQL agent. The database connection is opened mode=ro, so a write is physically rejected even if it slipped the validator.
  • Swap the company, keep the crew. Point the config at another SQLite file and another folder of .md policies; the schema is read live and the corpus is indexed at startup.

About

A multi-agent crew that answers business questions no single source can: a planner routes work to a self-correcting SQL agent and a BM25 doc retriever, a verifier feeds objections back, and every answer cites its evidence or honestly refuses.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages