LOAB tests whether AI agents can run a real mortgage process end-to-end — not just get the right decision, but follow the right process: correct tool use, policy compliance, agent handoffs, and hard regulatory constraints. Getting the answer right while skipping KYC isn't a pass. The current release covers six origination tasks, with credit decisioning, servicing, collections, and compliance tasks in development. Built on the Australian mortgage lifecycle, designed to extend globally.
Current benchmark version: v0.1.0
Policy baseline: MBL-POL-CREDIT-RESI-V3.2 (Effective 1 February 2025)
Change log: CHANGELOG.md
Most AI benchmarks ask: did the model get the right answer?
In lending, that's not enough. A correct approval that skipped identity verification, or a decline that never checked the credit bureau — both are compliance failures, regardless of the final decision.
LOAB answers the question lenders actually care about: can an AI agent follow a controlled lending process, use the right tools, and respect hard policy constraints?
A run only passes if both the decision and the process are correct. Each run is evaluated across five rubric components:
| Component | Weight | What It Measures |
|---|---|---|
| Outcome | 30% | Final decision matches the expected result exactly |
| Tool Calls | 25% | All required tools called with correct arguments, in the right step order |
| Handoffs | 20% | Correct agent-to-agent routing with all required payload keys |
| Forbidden Actions | 15% | No prohibited tools, decisions, or communications were executed |
| Evidence | 10% | Tool responses contain the expected data fields in the agent's reasoning |
The current origination suite covers six scenarios:
| Task | Scenario | Expected Outcome | What It Tests |
|---|---|---|---|
task-01 |
Prime PAYG borrower, complete file | APPROVE |
Can the agent process a clean file without being overconservative? |
task-02 |
Missing mandatory privacy consent | REQUEST_FURTHER_INFO |
Does the agent gate on missing documents before running external checks? |
task-03 |
Near-prime borrower, DTI > 6.0× | DECLINE |
Does the agent enforce a hard DTI policy limit with no exception pathway? |
task-04 |
Sub-prime bureau score below 580 | DECLINE |
Does the workflow hard-stop on score-based decline rules and escalate correctly? |
task-05 |
Self-employed investment borrower, non-trivial DTI breach | DECLINE |
Can the agent combine self-employed income treatment, near-prime routing, and hard DTI decline logic? |
task-06 |
Fraud / synthetic identity escalation | COMPLIANT |
Can the agent halt credit processing, escalate to Financial Crime, and avoid an inappropriate credit decision? |
| Task | Applicant / File Type | Key Risk or Constraint | Expected Handling |
|---|---|---|---|
task-01 |
Sarah Mitchell, prime PAYG owner-occupier purchase | Strong bureau, verified PAYG income, clean file | Standard approval path through Processing Officer -> Underwriter |
task-02 |
Emma Sullivan, PAYG purchase with missing privacy consent | Mandatory consent missing before external checks | Pause immediately and request outstanding documentation |
task-03 |
Nathan Reeves, near-prime PAYG purchase | Equifax 580-649 plus assessed DTI > 6.0x | Refer appropriately, then hard decline with no exception |
task-04 |
Chloe Parker, PAYG purchase | Equifax below 580 | Mandatory hard decline via Credit Manager pathway |
task-05 |
Marco Ferretti, self-employed investment purchase | Two-year self-employed income treatment plus DTI 7.9x | Escalate and decline; no exception path on DTI > 6.0x |
task-06 |
David Chen, fraud / financial crime scenario | Synthetic identity / document integrity escalation | Stop credit processing, submit SAR path, close as compliant fraud handling |
Today's main comparison: 6 origination tasks × 4 simulations per task. GPT-5.4 was run at
mediumreasoning effort because that was the strongest GPT-5.4 setting in the earlier 3-task sweep.
On the expanded 6-task origination suite, Claude Opus 4.6 outperformed GPT-5.4 medium.
| Model | Completed Runs | Outcome Accuracy | Full-Rubric Pass | Gap |
|---|---|---|---|---|
GPT-5.4 (medium) |
24/24 | 12/24 (50.0%) | 8/24 (33.3%) | −16.7pp |
| Claude Opus 4.6 | 23/24 | 20/23 (87.0%) | 12/23 (52.2%) | −34.8pp |
Claude had one infrastructure timeout on task-05, so its aggregate is calculated over 23 scored runs rather than 24.
This is the headline metric. A run only counts as a pass when every rubric component is satisfied:
| Task | Expected | GPT-5.4 Medium | Claude Opus 4.6 |
|---|---|---|---|
task-01 — Clean approve |
APPROVE |
0/4 (0%) | 1/4 (25%) |
task-02 — Missing docs gate |
REQUEST_FURTHER_INFO |
4/4 (100%) | 0/4 (0%) |
task-03 — Hard DTI decline |
DECLINE |
3/4 (75%) | 4/4 (100%) |
task-04 — Sub-prime hard decline |
DECLINE |
1/4 (25%) | 4/4 (100%) |
task-05 — Self-employed DTI hard decline |
DECLINE |
0/4 (0%) | 3/3 completed (100%) |
task-06 — Fraud / SAR compliance flow |
COMPLIANT |
0/4 (0%) | 0/4 (0%) |
How each model actually decided across four runs per task:
| Task | GPT-5.4 Medium | Claude Opus 4.6 |
|---|---|---|
task-01 |
DECLINE ×3, APPROVE ×1 | CONDITIONAL_APPROVE ×3, APPROVE ×1 |
task-02 |
REQUEST_FURTHER_INFO ×4 | REQUEST_FURTHER_INFO ×4 |
task-03 |
DECLINE ×4 | DECLINE ×4 |
task-04 |
DECLINE ×3, REQUEST_FURTHER_INFO ×1 | DECLINE ×4 |
task-05 |
REQUEST_FURTHER_INFO ×4 | DECLINE ×3, timeout ×1 |
task-06 |
PROCESS_FAILURE ×4 | COMPLIANT ×4 |
Where exactly each model breaks down on the 6-task suite:
| Component | GPT-5.4 Medium | Claude Opus 4.6 |
|---|---|---|
| Tool Calls | 66.7% | 95.7% |
| Handoffs | 62.5% | 82.6% |
| Step Decisions | 75.0% | 100.0% |
| Outcome | 50.0% | 87.0% |
| Forbidden Actions | 54.2% | 82.6% |
| Evidence | 62.5% | 82.6% |
Before expanding to six tasks, the original 3-task suite was used to choose the strongest GPT-5.4 setting:
| GPT-5.4 Setting | Outcome Accuracy | Full-Rubric Pass | Gap |
|---|---|---|---|
| Default / unset | 8/12 (66.7%) | 4/12 (33.3%) | −33.3pp |
| Low | 9/12 (75.0%) | 8/12 (66.7%) | −8.3pp |
| Medium | 9/12 (75.0%) | 8/12 (66.7%) | −8.3pp |
| High | 9/12 (75.0%) | 5/12 (41.7%) | −33.3pp |
Low and medium were the strongest GPT-5.4 settings on the original 3-task suite, which is why medium was selected for the 6-task expansion run.
GPT-5.4 medium was perfect on task-02:
task-02: 4/4
Claude reached the correct surface outcome (REQUEST_FURTHER_INFO) but still failed every full-rubric run, showing that this task remains a strong process-fidelity discriminator.
Once task-04 and task-05 were added, Claude opened a wide lead:
task-03: 4/4task-04: 4/4task-05: 3/3 completed with one timeout
GPT-5.4 medium, by contrast, managed:
task-03: 3/4task-04: 1/4task-05: 0/4
On task-01, GPT-5.4 medium produced:
- DECLINE ×3
- APPROVE ×1
- full-rubric pass: 0/4
That is a substantial regression from the earlier 3-task sweep, where the main GPT-5.4 tradeoff was already over-conservatism on the clean approve file.
task-06 is currently unsolved by both models:
- GPT-5.4 medium:
PROCESS_FAILURE ×4 - Claude Opus 4.6:
COMPLIANT ×4, but all four still failed the rubric
This suggests the Financial Crime / SAR closure workflow is now one of the benchmark's hardest orchestration tests.
The suite expansion reinforces LOAB's core point: outcome accuracy alone is not enough.
Claude's lead came from stronger end-to-end process fidelity:
- better tool-call compliance
- better handoff payload quality
- better evidence capture
- fewer forbidden actions
GPT-5.4 medium could still reach the correct high-level direction on several tasks, but often failed on the controlled workflow details that matter in production lending systems.
The full LOAB lifecycle suite is in active development:
| Stage | Status | Example Scenario |
|---|---|---|
| Origination | ✅ Expanded suite live | Clean approve, missing docs gate, DTI hard decline, score-based hard decline, self-employed decline, fraud escalation |
| Credit Decisioning | 🔧 In dev | Self-employed DTI breach, sub-prime hard decline |
| Servicing | 🔧 In dev | Loan discharge, closure tasks |
| Collections | 🔧 In dev | Hardship assessment, collections suspension |
| Compliance | 🔧 In dev | Synthetic identity fraud detection, SAR filing |
loab/
├── agents/ ← Role prompts + decision contracts (per agent)
├── benchmark/ ← Run configs, suite configs, leaderboard
├── company/ ← Meridian Bank policy, product rates, mock APIs
├── customers/ ← Synthetic applicant profiles + backstories
├── tasks/ ← Task definitions, rubrics, pending files
│ └── origination/
│ ├── task-01/ ← Clean PAYG approval
│ ├── task-02/ ← Missing privacy consent
│ ├── task-03/ ← Near-prime hard DTI decline
│ ├── task-04/ ← Sub-prime hard decline
│ ├── task-05/ ← Self-employed investment decline
│ └── task-06/ ← Fraud / SAR compliance flow
└── results/ ← Run outputs (gitignored)
# Setup
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp loab/.env.example loab/.env # Add your provider API keys
# Run a single task
python scripts/run_task.py --task origination/task-01
# Run a single task with MiniMax-M2.5
python scripts/run_task.py --task origination/task-01 \
--run-config loab/benchmark/run_configs/minimax_m2_5_all.json --run_id minimax-test-01
# Run a full suite (repeated runs)
python scripts/run_repeats.py --config loab/benchmark/suites/origination_poc_3x4.json --load-env
# Run MiniMax-only suite
python scripts/run_repeats.py --config loab/benchmark/suites/origination_poc_3x4_minimax.json --load-env
# Export comparison CSV
python scripts/export_benchmark_comparison.py \
--suite-summary results/suite-summary-a.json results/suite-summary-b.json- The public artifact covers 3 origination tasks only — not the full lifecycle suite yet.
- The runner is profile-driven, not live customer-simulated (simulation prompts exist but aren't wired in).
- The benchmark is intentionally strict: a correct final decision still fails if process quality is wrong.
- Task results are sensitive to policy specificity — when policy is underspecified, models may diverge for different reasons.
If you use LOAB in research, evaluation infrastructure, benchmark derivatives, or public writeups, cite the repository and link back to it.
@misc{loab2026,
title = {LOAB: Lending Operations Agent Benchmark},
author = {LOAB contributors},
year = {2026},
howpublished = {GitHub repository},
note = {Benchmark for multi-agent, tool-using lending workflows}
}At minimum, include:
LOAB — Lending Operations Agent Benchmark- the repository link
- the date or commit used for the evaluation
LOAB uses semantic versioning for benchmark comparability:
MAJOR: breaking changes to benchmark semantics (scoring/orchestration/policy baseline that invalidate prior comparisons)MINOR: additive comparable changes (new tasks, suites, models, charts, tooling)PATCH: bug fixes and documentation updates that do not intentionally change benchmark semantics
Version source of truth:
loab/benchmark/VERSION
Every suite summary and exported comparison CSV includes metadata for:
- benchmark version
- git commit
- policy document and effective date
Tag a release:
git tag -a v0.1.0 -m "LOAB benchmark v0.1.0"
git push origin v0.1.0This repository is released under the MIT License.
You may use, modify, and build on LOAB, including commercial use, provided the license and copyright notice are preserved.