Lease Abstraction Assistant

An internal web app for a real estate company to upload lease PDFs, extract the core lease abstraction fields with AI (Gemini), review & correct the results, approve the final record, and export approved records as CSV or Excel.

What the app does

Upload a commercial lease PDF.
The backend stores the document and extracts its text (with OCR fallback for scans).
The extracted text is sent to the Gemini Flash model, which returns strict JSON for the 10 Phase 1 lease fields.
A human reviewer verifies/corrects every field in an editable, grouped table.
The reviewer saves a draft or approves the record.
Approved records only can be exported as CSV or Excel (one row per document).

The 10 Phase 1 fields

#	Field	Section
1	Landlord / Property Owner Name	Parties
2	Tenant / Business Name	Parties
3	Guarantor Name(s)	Parties
4	Mailing Addresses for all parties	Contact & Address
5	Contact Information (phone & email)	Contact & Address
6	Effective Date / Lease Start Date	Lease Dates
7	Lease End Date	Lease Dates
8	Lease Length	Lease Dates
9	Renewal Option Details	Options & Terms
10	Holdover Terms	Options & Terms

Pages

Upload & Dashboard — summary cards (total / processed / needs review / approved), drag-and-drop PDF upload, recent uploads table.
Document History — all uploaded documents with status, extraction method and text quality.
Document Review — metadata, summary, warning banners, editable extraction table grouped into 4 sections, a Needs Review panel, and action buttons (Save Draft, Approve Record, Back to History, Export Approved Data).
Export Data — download approved lease records as CSV or Excel.

Architecture

backend/
  server.py                 FastAPI app (mounts routers, /api/health)
  db.py                     Mongo connection + ObjectId helpers
  fields.py                 10 canonical fields, sections, normalize_fields()
  processing.py             upload pipeline: extract text -> AI -> persist
  serializers.py            Mongo doc -> JSON (no ObjectId leakage)
  seed.py                   seeds 6 realistic demo documents
  routes/
    upload.py               POST /api/upload
    documents.py            list / stats / get / draft / approve
    export.py               GET /api/export/approved (CSV) and /excel (XLSX)
  services/
    extraction.py           pdfplumber text extraction + OCR fallback (modular)
    ai_extraction.py        Gemini (active) + Claude placeholder (modular)
    csv_export.py           builds the shared 14-column export rows and CSV
    excel_export.py         builds the XLSX workbook
frontend/
  src/pages/                Dashboard, History, Review, ExportData
  src/components/           Layout (sidebar+header), ui (badges/buttons/etc.)
  src/lib/api.js            axios client (uses REACT_APP_BACKEND_URL)

Data model (MongoDB, `documents` collection)

file_name, upload_date, status, extraction_method, text_quality_score, char_count, summary, raw_text, overall_status, warnings[]
fields[] — embedded array, one per canonical field: { fieldName, value, confidence, evidence, status, section }

Running locally

Services:

Backend: FastAPI on :8001 (all routes prefixed /api)
Frontend: React (CRA) on :3000
Database: MongoDB

# Create and activate the backend environment
python3.13 -m venv .venv
source .venv/bin/activate

# Install backend dependencies
pip install -r backend/requirements.txt

# Install frontend dependencies
cd frontend
npm install
cd ..

# Optional: seed demo data
cd backend
python seed.py

# Start the backend
uvicorn server:app --reload --port 8001

In a second terminal:

cd frontend
npm start

Deploying on Vercel

The root vercel.json deploys the React frontend and FastAPI backend as Vercel Services under one domain. The frontend uses /api for backend requests when REACT_APP_BACKEND_URL is not set.

Configure these environment variables in the Vercel project:

MONGO_URL: a hosted MongoDB connection string, such as MongoDB Atlas. Do not use localhost.
DB_NAME: MongoDB database name.
GEMINI_API_KEY: Google Gemini API key.
AI_PROVIDER=gemini
GEMINI_MODEL=gemini-2.5-flash

The local Docker MongoDB container is only available on your machine and cannot be reached by a Vercel deployment.

Environment variables:

backend/.env: MONGO_URL, DB_NAME, AI_PROVIDER=gemini, GEMINI_MODEL, GEMINI_API_KEY, ANTHROPIC_API_KEY (future)
frontend/.env: REACT_APP_BACKEND_URL

Upload → Review → Approval → Export flow

Upload a PDF on the dashboard (drag-drop or click). The app processes it and routes you to the Review page.
Review: edit any value. Each field shows value, confidence, evidence snippet and status. Missing fields are blank with confidence 0 — the app never invents values.
Save Draft stores your corrections; Approve Record locks the record (read-only) and marks it approved.
Export Data → choose CSV or Excel and download. Only approved records are included; one row per document with these columns: File Name, Upload Date, Status, Extraction Method, and the 10 lease fields.

How Gemini extraction works

All AI calls happen in the backend only — API keys are never exposed to the frontend.
services/ai_extraction.py builds a strict prompt and calls Gemini Flash (gemini-2.5-flash) via Google's official google-genai SDK.
Prompt rules: strict JSON only, no markdown, no guessing/inventing, use only the lease text, include short evidence snippets, and mark unfound fields as missing.
Response is validated against the required schema. If Gemini returns invalid JSON or fails, the app does not crash — all fields fall back to missing and the document is marked needs_review. Provider, text length and number of fields returned are logged.

Where to add `GEMINI_API_KEY`

Set it in backend/.env:

GEMINI_API_KEY=your-key-here
AI_PROVIDER=gemini
GEMINI_MODEL=gemini-2.5-flash

If GEMINI_API_KEY is empty, the backend returns a clear error: "Gemini API key is not configured."

Replacing Gemini with Claude (future)

The AI layer is modular. services/ai_extraction.py already contains a _claude_extract placeholder that will:

read ANTHROPIC_API_KEY from the environment,
call Claude Sonnet from the backend,
send the extracted lease text and request strict JSON,
reuse the same JSON schema as Gemini.

Gemini remains the active provider by default. Claude extraction is not implemented yet and does not block the app.

Future Azure Document Intelligence OCR

OCR is isolated in services/extraction.py::extract_text_ocr (currently Tesseract-based and optional). It can be swapped for Azure Document Intelligence without touching the rest of the pipeline — the function just needs to return extracted text for the given PDF bytes.

Limitations of OCR and AI extraction

OCR: Tesseract is a basic MVP OCR. Scanned/low-quality PDFs may yield low text quality — these documents are flagged with an "OCR was used. Please review extracted values carefully." banner and routed to needs_review. (Tesseract is not installed by default in this build; the service degrades gracefully and marks such documents failed → needs_review.)
AI extraction: The model only uses the provided lease text and will not invent values. Ambiguous or unusual lease language can produce needs_review items or lower confidence — human review is always required before approval. Extraction is bounded to the first ~30k characters of text per document.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
backend		backend
frontend		frontend
memory		memory
test_reports		test_reports
.gitconfig		.gitconfig
.gitignore		.gitignore
README.md		README.md
vercel.json		vercel.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lease Abstraction Assistant

What the app does

The 10 Phase 1 fields

Pages

Architecture

Data model (MongoDB, `documents` collection)

Running locally

Deploying on Vercel

Upload → Review → Approval → Export flow

How Gemini extraction works

Where to add `GEMINI_API_KEY`

Replacing Gemini with Claude (future)

Future Azure Document Intelligence OCR

Limitations of OCR and AI extraction

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Lease Abstraction Assistant

What the app does

The 10 Phase 1 fields

Pages

Architecture

Data model (MongoDB, documents collection)

Running locally

Deploying on Vercel

Upload → Review → Approval → Export flow

How Gemini extraction works

Where to add GEMINI_API_KEY

Replacing Gemini with Claude (future)

Future Azure Document Intelligence OCR

Limitations of OCR and AI extraction

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Data model (MongoDB, `documents` collection)

Where to add `GEMINI_API_KEY`

Packages