This project implements an advanced ReAct (Reason + Act) agent using FastAPI and LangChain. Unlike standard conversational bots, this agent employs a sophisticated middleware architecture that dynamically manages context summarization, handles model fallbacks, and orchestrates task lists to ensure robust performance.
The system is designed as a stateful, memory-aware service during runtime, but operates on an ephemeral basis. All conversation history and task states are stored exclusively in RAM (via MemorySaver) and are permanently purged upon server restart, ensuring strict data privacy and a clean state for development.
The request processing pipeline consists of the following layers:
- API Entry Point: Receives the user query and session metadata via FastAPI.
- Middleware Layer:
- Summarization: Analyzes token usage. If the history exceeds 4000 tokens, it compresses the context using a lightweight model (
gpt-oss-20b). - Todo List: Extracts and manages sub-tasks for complex queries.
- Summarization: Analyzes token usage. If the history exceeds 4000 tokens, it compresses the context using a lightweight model (
- Reasoning Node: The primary LLM (
llama-3.3-70b) analyzes the context and decides whether to act or answer. - Tool Execution Node: Executes external actions (Calculator or Web Search) if requested by the reasoning node.
- Retry & Fallback Layer:
- Tool Retry: Automatically retries failed tool calls up to 3 times with backoff.
- Model Fallback: Switches to a larger model (
gpt-oss-120b) if the primary model fails.
- Response Formatting: Synthesizes the final answer and usage metadata into a structured JSON.
The agent utilizes a chain of middleware components to enhance reliability and context management:
- Summarization Middleware: Automatically prevents context window overflow by summarizing conversation history once it passes a threshold (4000 tokens), retaining only the last 20 messages verbatim.
- Model Fallback Middleware: Provides high availability by seamlessly switching to a backup model (
openai/gpt-oss-120b) if the primary inference engine encounters errors. - Todo List Middleware: Maintains an internal state of pending tasks, allowing the agent to break down complex user requests into manageable steps.
To bridge the gap between language modeling and factual accuracy, the system integrates specific tools:
- Tavily Search: Used for retrieving real-time information, news, and facts from the web.
- Numexpr Calculator: Used for precise mathematical evaluations, eliminating LLM arithmetic hallucinations.
- Auto-Retry: If a tool fails (e.g., API timeout or syntax error), the
ToolRetryMiddlewareintercepts the error and retries the operation with exponential backoff.
Unlike simple text streams, the API enforces strict data contracts:
- Input: Requires a query and a
thread_idfor session continuity. - Output: Returns a
ResponseFormatobject containing the final answer and a list of specific tools used during the generation process.
- API Framework: FastAPI
- Orchestration: LangChain, LangGraph
- LLM Inference: Groq Cloud (Llama 3.3 70B, GPT-OSS variants)
- External Search: Tavily AI Search
- Math Engine: Numexpr
- Validation: Pydantic
- Python 3.10 or higher
- API Keys for Google, Groq, and Tavily.
git clone https://github.com/your-username/react-agent-api.git
cd react-agent-api
python -m venv venv
source venv/bin/activate # Linux/Mac
# venv\Scripts\activate # Windows
pip install -r requirements.txt
Create a .env file in the project root with the following variables:
GOOGLE_API_KEY=your_google_api_key_here
GROQ_API_KEY=your_groq_api_key_here
TAVILY_API_KEY=your_tavily_api_key_here
# Optional: LangSmith Tracing (useful for debugging the graph)
LANGCHAIN_TRACING_V2=true
LANGSMITH_ENDPOINT=https://eu.api.smith.langchain.com
LANGCHAIN_API_KEY=your_langchain_api_key_here
LANGCHAIN_PROJECT="ReAct-Agent"python main.py
The API will start at http://127.0.0.1:8000.
Interactive API documentation (Swagger UI) is available at http://127.0.0.1:8000/docs.
Simple health check endpoint to verify the service status.
- Response: JSON with status, service name, and docs link.
The primary interface for interaction. Triggers the Agent workflow with middleware support.
-
JSON Body:
-
query: The user's question. -
thread_id: A unique string identifier for the user session (e.g., "session_001"). -
Response:
-
response: The generated text answer. -
tool_usage: A list of strings indicating which tools were utilized (e.g.,['tavily_search', 'calculator']).
The project is currently configured to use Groq for high-speed inference. It uses different models for Chat, Summarization, and Fallback scenarios.
Model definitions are located in app/services/agent.py.
To change the specific Llama or GPT-OSS version, update the model parameter in the get_agent function:
# app/services/agent.py
# 1. Main Chat Model
chat_llm = ChatGroq(
model="llama-3.3-70b-versatile", # Change to desired model
temperature=0,
api_key=settings.GROQ_API_KEY
)
# 2. Summarization Model
summarization_llm = ChatGroq(
model="mixtral-8x7b-32768", # Example of changing model
temperature=0,
api_key=settings.GROQ_API_KEY
)Since the project uses LangChain, switching providers requires minimal code changes.
- Install the provider package:
pip install langchain-openai
- Update imports and initialization in
app/services/agent.py:
from langchain_openai import ChatOpenAI
# Replace ChatGroq with ChatOpenAI
chat_llm = ChatOpenAI(
model="gpt-4o",
temperature=0,
api_key="your_openai_key"
)Note: Ensure you update the .env file with the necessary API keys for the new provider.
This application operates in Ephemeral Mode.
- In-Memory MemorySaver: The agent's memory (checkpoints) is initialized using
MemorySaver(). Conversations exist only in RAM. - Server Restart: Upon terminating the process or restarting the server (
uvicorn), all conversation history, todo lists, and session data are permanently erased. This ensures no sensitive data persists on the disk.