A blog post demo project demonstrating how to build a multimodal ReAct agent from scratch using LangGraph and Gemini. This agent can analyze invoice and receipt images and save the data to Excel.
- Multimodal OCR: Invoice/receipt analysis with Gemini's vision capabilities
- ReAct Pattern: Reasoning + acting loop with LangGraph
- Excel Integration: Save and query analyzed invoices
- CLI Interface: User-friendly command line experience with Typer
# Install dependencies
uv sync
# Copy the environment template and add your API key
cp .env.template .env
# Edit .env and add your Gemini API key# Analyze an invoice
uv run python main.py analyze invoice.png
# Analyze and save to Excel
uv run python main.py analyze invoice.png --saveuv run python main.py listuv run python main.py ask "Which invoice has the highest total?"uv run python main.py chat├── main.py # CLI entry point
├── src/
│ └── agent/
│ ├── graph.py # LangGraph agent definition
│ ├── nodes.py # Agent nodes
│ ├── state.py # Agent state definition
│ ├── tools.py # Excel tools
│ └── prompts.py # System prompt
└── invoices.xlsx # Saved invoices
- PNG
- JPEG/JPG
- GIF
- WebP
- Python 3.12+
- Google Gemini API key