A Retrieval-Augmented Generation (RAG) system applied to technical and regulatory documents from the Brazilian electricity sector (ANEEL). The system combines hybrid search (BM25 + dense embeddings) with a LangGraph-based agent that performs query expansion (HyDE + reformulations), multi-step retrieval, answer generation using Claude, and automatic faithfulness verification. The interface is exposed via FastAPI and Streamlit.
PDF, XLSX, and other documents
│
▼
┌───────────────────────────────────────────────────────┐
│ Parsing & Indexing │
│ PyMuPDF and others → Chunks → Embeddings → Qdrant │
│ BM25Retriever (local sparse index) │
└───────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────┐
│ Hybrid Retrieval │
│ BM25 + Dense → RRF Fusion → Reranker │
│ (CrossEncoderReranker) │
└───────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────┐
│ LangGraph Agent │
│ query_analyzer → query_expander → │
│ retriever → reranker → │
│ context_assembler → generator → │
│ faithfulness_check (self-correction loop) │
└───────────────────────────────────────────────────────┘
│
├─── FastAPI (app/api.py) → [http://localhost:8000](http://localhost:8000)
└─── Streamlit (app/ui.py) → [http://localhost:8501](http://localhost:8501)
To run this project, your machine must have:
- Docker
- Docker Compose
- Google Cloud SDK (gcloud CLI) (for data/snapshot synchronization)
- A valid Anthropic (Claude) API key
git clone https://github.com/buenofgustavo/desafio-agentes-nlp.git
cd desafio-agentes-nlpAt the root of the project, there is a .env.example file.
- Rename it to
.env - Open it and add your Anthropic API key:
ANTHROPIC_API_KEY=sk-ant-your-key-here...mkdir -p qdrant_setup
gcloud storage cp gs://aneel-raw-data/qdrant-snapshot/desafio-agentes-nlp.snapshot qdrant_setup/docker-compose up -dcurl -v -X PUT 'http://localhost:6333/collections/setor_eletrico/snapshots/recover' \
-H 'Content-Type: application/json' \
-d '{"location": "file:///qdrant/snapshots/desafio-agentes-nlp.snapshot"}'⏳ Note: This process may take several minutes and might appear to hang.
To monitor progress:
docker logs -f qdrant_setor_eletricodocker exec -it rag_api_setor_eletrico python -m src.retrieval.bm25_retriever --rebuildAll processed data, raw documents, and Qdrant snapshots are stored in our GCP (Google Cloud Platform) bucket.
This ensures:
- Fast replication for quick setup
- Up-to-date data as a single source of truth
Useful commands are available in the Makefile:
make sync-datamake sync-processed-jsonmake sync-qdrant-snapshot
The scripts/ folder contains executable pipeline steps for ingestion, indexing, and setup.
desafio-agentes-nlp/
├── app/
│ ├── api.py # FastAPI — inference and healthcheck endpoints
│ └── ui.py # Streamlit UI for user interaction
├── data/
│ ├── raw/ # Raw documents (PDF, XLSX, etc.)
│ └── processed/ # Processed chunks in JSON format
├── qdrant_setup/ # Qdrant snapshots and configuration files
├── scripts/
│ ├── download_dataset.py # Dataset download (JSONs)
│ ├── run_indexing.py # Indexing and embedding pipeline
│ ├── run_ingestion.py # Document ingestion pipeline
│ ├── run_agent.py # CLI agent execution
│ └── setup_collection.py # Qdrant collection setup
├── src/
│ ├── agent/ # LangGraph agent logic
│ │ ├── graph.py # Graph definition and compilation
│ │ ├── nodes.py # Node implementations
│ │ ├── state.py # Agent state schema
│ │ └── query_expansion.py # Query expansion (HyDE)
│ ├── ai/
│ │ ├── embeddings/ # Vector generation (all-MiniLM-L6-v2)
│ │ └── llm/ # LLM clients (Anthropic, OpenAI, Ollama)
│ ├── core/
│ │ ├── config.py # Environment configuration
│ │ └── models.py # Pydantic models
│ ├── indexing/ # Ingestion, processing, and storage
│ ├── retrieval/ # Search strategies (BM25, semantic, hybrid)
│ └── utils/
| Decision | Rationale |
|---|---|
| Hybrid retrieval (BM25 + dense) with RRF | Combines keyword recall with semantic understanding; robust to score differences |
| Cross-encoder reranking | Higher precision ranking than bi-encoders |
| LangGraph orchestration | Clean state management and multi-step workflows |
| HyDE + query reformulation | Improves recall for ambiguous queries |
| FastAPI + Streamlit separation | Decoupled backend and frontend |
During ANEEL document downloads, the standard requests library was often blocked.
Solution: curl_cffi with sessions
- Browser impersonation: Mimics real browsers to bypass Cloudflare
- Session handling: Maintains cookies and reuse connections (keep-alive), improving performance
- Igor Reis Braziel — braziel@discente.ufg.br
- Gustavo Bueno Ferreira — gustavobueno2@discente.ufg.br
- Breno Machado Barros — breno_machado@discente.ufg.br
