RAG — Brazilian Electricity Sector

A Retrieval-Augmented Generation (RAG) system applied to technical and regulatory documents from the Brazilian electricity sector (ANEEL). The system combines hybrid search (BM25 + dense embeddings) with a LangGraph-based agent that performs query expansion (HyDE + reformulations), multi-step retrieval, answer generation using Claude, and automatic faithfulness verification. The interface is exposed via FastAPI and Streamlit.

🏗️ Architecture


PDF, XLSX, and other documents
│
▼
┌───────────────────────────────────────────────────────┐
│  Parsing & Indexing                                   │
│  PyMuPDF and others → Chunks → Embeddings → Qdrant    │
│  BM25Retriever (local sparse index)                   │
└───────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────┐
│  Hybrid Retrieval                                     │
│  BM25 + Dense → RRF Fusion → Reranker                 │
│  (CrossEncoderReranker)                               │
└───────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────┐
│  LangGraph Agent                                      │
│  query_analyzer → query_expander →                    │
│  retriever → reranker →                               │
│  context_assembler → generator →                      │
│  faithfulness_check (self-correction loop)            │
└───────────────────────────────────────────────────────┘
│
├─── FastAPI  (app/api.py) → [http://localhost:8000](http://localhost:8000)
└─── Streamlit (app/ui.py) → [http://localhost:8501](http://localhost:8501)

🧠 Architecture Diagram

⚙️ Prerequisites

To run this project, your machine must have:

Docker
Docker Compose
Google Cloud SDK (gcloud CLI) (for data/snapshot synchronization)
A valid Anthropic (Claude) API key

🚀 Quick Start Guide (Step-by-Step)

Step 0: Clone the Repository

git clone https://github.com/buenofgustavo/desafio-agentes-nlp.git
cd desafio-agentes-nlp

Step 1: Configure the API Key

At the root of the project, there is a .env.example file.

Rename it to .env
Open it and add your Anthropic API key:

ANTHROPIC_API_KEY=sk-ant-your-key-here...

Step 2: Download the Database Snapshot

mkdir -p qdrant_setup

gcloud storage cp gs://aneel-raw-data/qdrant-snapshot/desafio-agentes-nlp.snapshot qdrant_setup/

Step 3: Start the Infrastructure

docker-compose up -d

Step 4: Restore the Vector Database (Qdrant)

curl -v -X PUT 'http://localhost:6333/collections/setor_eletrico/snapshots/recover' \
-H 'Content-Type: application/json' \
-d '{"location": "file:///qdrant/snapshots/desafio-agentes-nlp.snapshot"}'

⏳ Note: This process may take several minutes and might appear to hang.

To monitor progress:

docker logs -f qdrant_setor_eletrico

Step 5 (Optional): Build the BM25 Index

docker exec -it rag_api_setor_eletrico python -m src.retrieval.bm25_retriever --rebuild

Step 6: Access the Application

👉 http://localhost:8501

🗄️ Data Management and Scripts

All processed data, raw documents, and Qdrant snapshots are stored in our GCP (Google Cloud Platform) bucket.

This ensures:

Fast replication for quick setup
Up-to-date data as a single source of truth

Useful commands are available in the Makefile:

make sync-data
make sync-processed-json
make sync-qdrant-snapshot

The scripts/ folder contains executable pipeline steps for ingestion, indexing, and setup.

📁 Project Structure

desafio-agentes-nlp/
├── app/
│   ├── api.py              # FastAPI — inference and healthcheck endpoints
│   └── ui.py               # Streamlit UI for user interaction
├── data/
│   ├── raw/                # Raw documents (PDF, XLSX, etc.)
│   └── processed/          # Processed chunks in JSON format
├── qdrant_setup/           # Qdrant snapshots and configuration files
├── scripts/
│   ├── download_dataset.py # Dataset download (JSONs)
│   ├── run_indexing.py     # Indexing and embedding pipeline
│   ├── run_ingestion.py    # Document ingestion pipeline
│   ├── run_agent.py        # CLI agent execution
│   └── setup_collection.py # Qdrant collection setup
├── src/
│   ├── agent/              # LangGraph agent logic
│   │   ├── graph.py        # Graph definition and compilation
│   │   ├── nodes.py        # Node implementations
│   │   ├── state.py        # Agent state schema
│   │   └── query_expansion.py # Query expansion (HyDE)
│   ├── ai/
│   │   ├── embeddings/     # Vector generation (all-MiniLM-L6-v2)
│   │   └── llm/            # LLM clients (Anthropic, OpenAI, Ollama)
│   ├── core/
│   │   ├── config.py       # Environment configuration
│   │   └── models.py       # Pydantic models
│   ├── indexing/           # Ingestion, processing, and storage
│   ├── retrieval/          # Search strategies (BM25, semantic, hybrid)
│   └── utils/

⚙️ Key Design Decisions

Decision	Rationale
Hybrid retrieval (BM25 + dense) with RRF	Combines keyword recall with semantic understanding; robust to score differences
Cross-encoder reranking	Higher precision ranking than bi-encoders
LangGraph orchestration	Clean state management and multi-step workflows
HyDE + query reformulation	Improves recall for ambiguous queries
FastAPI + Streamlit separation	Decoupled backend and frontend

🕵️‍♂️ Data Extraction

During ANEEL document downloads, the standard requests library was often blocked.

Solution: curl_cffi with sessions

Browser impersonation: Mimics real browsers to bypass Cloudflare
Session handling: Maintains cookies and reuse connections (keep-alive), improving performance

👨‍💻 Team

Igor Reis Braziel — braziel@discente.ufg.br
Gustavo Bueno Ferreira — gustavobueno2@discente.ufg.br
Breno Machado Barros — breno_machado@discente.ufg.br

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
.github/workflows		.github/workflows
app		app
scripts		scripts
src		src
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile.api		Dockerfile.api
Dockerfile.ui		Dockerfile.ui
Makefile		Makefile
README.md		README.md
arquitetura.png		arquitetura.png
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG — Brazilian Electricity Sector

🏗️ Architecture

🧠 Architecture Diagram

⚙️ Prerequisites

🚀 Quick Start Guide (Step-by-Step)

Step 0: Clone the Repository

Step 1: Configure the API Key

Step 2: Download the Database Snapshot

Step 3: Start the Infrastructure

Step 4: Restore the Vector Database (Qdrant)

Step 5 (Optional): Build the BM25 Index

Step 6: Access the Application

🗄️ Data Management and Scripts

📁 Project Structure

⚙️ Key Design Decisions

🕵️‍♂️ Data Extraction

👨‍💻 Team

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG — Brazilian Electricity Sector

🏗️ Architecture

🧠 Architecture Diagram

⚙️ Prerequisites

🚀 Quick Start Guide (Step-by-Step)

Step 0: Clone the Repository

Step 1: Configure the API Key

Step 2: Download the Database Snapshot

Step 3: Start the Infrastructure

Step 4: Restore the Vector Database (Qdrant)

Step 5 (Optional): Build the BM25 Index

Step 6: Access the Application

🗄️ Data Management and Scripts

📁 Project Structure

⚙️ Key Design Decisions

🕵️‍♂️ Data Extraction

👨‍💻 Team

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages