Distributed RAG chat pipeline: CPU workers embed queries, retrieve context from Qdrant, enqueue jobs on Redis Streams, and stream answers from GPU nodes running vLLM back to clients over HTTP. A React UI talks to the worker API; Terraform can stand up the stack on DigitalOcean.
- Architecture
- Repository layout
- Request lifecycle
- Prerequisites
- Quick start (local)
- Configuration
- HTTP API
- Redis protocol
- GPU worker (vLLM)
- Cloud deployment
- Troubleshooting
- Further reading
flowchart LR
subgraph clients [Clients]
Web["React client\n(Vite)"]
end
subgraph cpu [CPU tier]
W["FastAPI worker\n(worker_module)"]
E["Embeddings\nSentence-Transformers"]
W --> E
end
subgraph data [Data plane]
Q[(Qdrant\nRAG + semantic cache)]
R[(Redis\nStreams + Pub/Sub)]
end
subgraph gpu [GPU tier]
V["vLLM consumer\n(vLLM/thunder.py)"]
end
Web -->|"POST /build-context\n(SSE)"| W
W --> Q
W -->|"XADD inference_queue"| R
R -->|"XREADGROUP"| V
V -->|"PUBLISH result:{id}"| R
R --> W
W -->|"token chunks"| Web
- One path serves cache hits from Qdrant only (no GPU).
- The default path runs embed → RAG → enqueue → subscribe to
result:{request_id}→ stream tokens untilis_final.
| Path | Role |
|---|---|
worker_module/ |
FastAPI orchestrator: embeddings, Qdrant RAG + semantic cache, Redis queue and result fan-out, SSE streaming. |
client/ |
React + Vite chat UI; calls the worker over HTTP. |
vLLM/ |
Async vLLM engine that consumes Redis Stream jobs and publishes token deltas to Redis Pub/Sub. |
terraform/ |
DigitalOcean infrastructure (workers, Redis, load balancer, VPC, optional autoscale notes). |
- Client sends
POST /build-contextwith{ "text": "..." }. - Worker embeds the text (thread pool +
sentence-transformers). - Semantic cache (Qdrant collection
CACHE_COLLECTION): if a near-duplicate query exists, the worker streams the cached answer and skips GPU work. - RAG: vector search in
QDRANT_COLLECTIONreturns top‑k chunks with scores ≥RAG_SCORE_THRESHOLD. - Dispatch: job JSON is
XADD’d to the Redis stream (REDIS_STREAM_KEY, defaultinference_queue). The worker subscribes toresult:{request_id}before enqueueing so no early tokens are missed. - GPU consumer (
thunder.py) reads the stream with a consumer group, runs vLLM, andPUBLISHes JSON messages toresult:{request_id}. - Worker forwards each message as Server-Sent Events (
text/event-stream) to the browser. After the final chunk it may write the full answer into the semantic cache collection.
| Component | Notes |
|---|---|
| Python 3.11+ | Worker Docker image uses 3.11; local runs should match for fewer surprises. |
| Node.js 18+ | For the Vite client. |
| Redis 7 | Streams + Pub/Sub; password optional via URL. |
| Qdrant | Local (localhost:6333) or cloud (QDRANT_URL + QDRANT_API_KEY). Collections must exist (see worker_module/setup_qdrant.py). |
| CUDA + vLLM | Only on machines running vLLM/thunder.py; not required for the CPU worker or static client build. |
| Terraform 1.x | Only if you deploy terraform/. |
Point worker_module/.env at your Qdrant instance (see Configuration), then from worker_module/:
pip install -r requirements.txt
python setup_qdrant.pyThis creates the RAG collection (QDRANT_COLLECTION, default my_docs) and the semantic cache collection (CACHE_COLLECTION, default semantic_cache) and seeds sample documents.
From worker_module/:
# Create worker_module/.env with at least Redis and Qdrant settings (see Configuration)
docker compose up --buildCompose starts Redis and the context-builder service on port 8000. Ensure .env contains at least QDRANT_HOST / QDRANT_PORT or QDRANT_URL so the worker can reach Qdrant (Qdrant is not defined in this compose file by default).
On a GPU machine with vLLM installed, from the vLLM/ directory (so config.yaml resolves):
pip install vllm redis pyyaml python-dotenv
set REDIS_URL=redis://localhost:6379
python thunder.pyEdit vLLM/config.yaml for model name, memory utilization, and vLLM engine limits.
From client/:
npm installCreate client/.env:
VITE_API_URL=http://localhost:8000npm run devOpen the URL Vite prints (typically http://localhost:5173).
CORS: If the API and dev server use different origins, configure CORS on FastAPI or a Vite dev proxy. See
client/README.md.
Production static build + nginx: see client/README.md and client/Dockerfile.
Worker settings are environment variables loaded from worker_module/.env (see worker_module/config.py).
| Variable | Default | Purpose |
|---|---|---|
EMBEDDING_MODEL |
all-MiniLM-L6-v2 |
Sentence-transformers model id. |
THREAD_POOL_SIZE |
CPU count | Threads for model.encode. |
QDRANT_URL |
(empty) | Remote Qdrant base URL; if set, used instead of host/port. |
QDRANT_API_KEY |
(empty) | Cloud Qdrant API key. |
QDRANT_HOST |
localhost |
Local Qdrant host. |
QDRANT_PORT |
6333 |
Local Qdrant port. |
QDRANT_COLLECTION |
my_docs |
RAG collection name. |
RAG_TOP_K |
5 |
Chunks retrieved per query. |
RAG_SCORE_THRESHOLD |
0.6 |
Minimum similarity for RAG hits. |
CACHE_COLLECTION |
semantic_cache |
Semantic cache collection. |
CACHE_SIMILARITY_THRESHOLD |
0.92 |
Minimum similarity for cache hit. |
REDIS_URL |
redis://localhost:6379 |
Redis connection URL. |
REDIS_STREAM_KEY |
inference_queue |
Stream name for GPU jobs (must match thunder.py). |
REDIS_STREAM_MAXLEN |
10000 |
Approximate max stream length. |
RESULT_TIMEOUT_SEC |
60 |
Max wait for GPU chunks per request. |
MAX_CONCURRENT_PER_PROCESS |
250 |
Semaphore for embed/RAG/dispatch phase. |
HOST / PORT |
0.0.0.0 / 8000 |
Uvicorn bind. |
UVICORN_WORKERS |
1 |
Process count in main.py. |
GPU script vLLM/thunder.py uses:
| Variable | Default | Purpose |
|---|---|---|
REDIS_URL |
redis://localhost:6379 |
Redis for stream + pub/sub. |
REDIS_KEY |
(none) | Redis password if required. |
Model and engine args come from vLLM/config.yaml.
Base URL: worker root (e.g. http://localhost:8000). All JSON bodies use UTF-8.
Returns JSON such as {"status":"ok"} — used for probes and load balancers.
Request body (InferRequest):
{ "text": "Your question or prompt" }Response: Content-Type: text/event-stream (SSE).
Each event line looks like:
data: {"chunk":"...","is_final":false}
Terminal event:
data: {"chunk":"","is_final":true}
data: [DONE]
Cache hits may include "cache_hit": true. Errors may appear as JSON with an error field in the payload (see worker implementation).
Integrators should parse SSE data: lines (strip the data: prefix, then JSON-parse) or use an SSE-aware client.
| Mechanism | Key / pattern | Payload |
|---|---|---|
| Stream | inference_queue (configurable via REDIS_STREAM_KEY) |
Field data: JSON with request_id, prompt, context (list of chunk dicts), embedding, metadata. |
| Consumer group | Group gpu_workers (in thunder.py) |
Used by vLLM workers; create is idempotent. |
| Pub/Sub | result:{request_id} |
JSON: chunk, is_final, optional error. |
Stream entries are acknowledged (XACK) by the GPU consumer after a successful final generation.
vLLM/thunder.py builds a prompt from RAG context chunks and the user prompt, runs AsyncLLMEngine.generate, and publishes incremental text to result:{request_id}.
Requirements:
- Working Redis reachable at
REDIS_URL. config.yamlin the current working directory when launching the script.- vLLM and a compatible GPU / driver stack for the chosen
modelinconfig.yaml.
Keep REDIS_STREAM_KEY in the CPU worker aligned with the stream name consumed in thunder.py (default inference_queue).
For DigitalOcean (droplets, Redis, load balancer, firewall, VPC, scaling notes):
Create a terraform/terraform.tfvars with your provider token and sizing; terraform.tfvars is gitignored. Use terraform output for load balancer and Redis addresses after apply.
| Symptom | Things to check |
|---|---|
| 503 from Qdrant | Qdrant running; QDRANT_URL / host+port; collections created (setup_qdrant.py). |
| 503 queue / dispatch | Redis up; REDIS_URL correct from worker and GPU host. |
| No tokens / timeout | GPU consumer running; same Redis; stream and consumer group healthy; GPU logs for OOM or model errors. |
| Stream never drains | Consumer group name / consumer process; stuck pending messages (XPENDING). |
| Client shows no text | Response is SSE with a data: prefix — parsers must strip it before JSON.parse. The metrics bar in the sample UI may still display /infer; the implemented route is /build-context. |
| CORS in dev | Align origins or add a Vite proxy / FastAPI CORS middleware. |
| Document | Content |
|---|---|
client/README.md |
Vite scripts, Docker build args, legacy NDJSON notes vs current SSE worker. |
terraform/README.md |
DO resources, outputs, autoscale script outline. |
No root license file is present in this repository; add one if you open-source or distribute the project.