DistroLLM (distro_llm)

Distributed RAG chat pipeline: CPU workers embed queries, retrieve context from Qdrant, enqueue jobs on Redis Streams, and stream answers from GPU nodes running vLLM back to clients over HTTP. A React UI talks to the worker API; Terraform can stand up the stack on DigitalOcean.

Architecture

flowchart LR
  subgraph clients [Clients]
    Web["React client\n(Vite)"]
  end

  subgraph cpu [CPU tier]
    W["FastAPI worker\n(worker_module)"]
    E["Embeddings\nSentence-Transformers"]
    W --> E
  end

  subgraph data [Data plane]
    Q[(Qdrant\nRAG + semantic cache)]
    R[(Redis\nStreams + Pub/Sub)]
  end

  subgraph gpu [GPU tier]
    V["vLLM consumer\n(vLLM/thunder.py)"]
  end

  Web -->|"POST /build-context\n(SSE)"| W
  W --> Q
  W -->|"XADD inference_queue"| R
  R -->|"XREADGROUP"| V
  V -->|"PUBLISH result:{id}"| R
  R --> W
  W -->|"token chunks"| Web

One path serves cache hits from Qdrant only (no GPU).
The default path runs embed → RAG → enqueue → subscribe to result:{request_id} → stream tokens until is_final.

Repository layout

Path	Role
`worker_module/`	FastAPI orchestrator: embeddings, Qdrant RAG + semantic cache, Redis queue and result fan-out, SSE streaming.
`client/`	React + Vite chat UI; calls the worker over HTTP.
`vLLM/`	Async vLLM engine that consumes Redis Stream jobs and publishes token deltas to Redis Pub/Sub.
`terraform/`	DigitalOcean infrastructure (workers, Redis, load balancer, VPC, optional autoscale notes).

Request lifecycle

Client sends POST /build-context with { "text": "..." }.
Worker embeds the text (thread pool + sentence-transformers).
Semantic cache (Qdrant collection CACHE_COLLECTION): if a near-duplicate query exists, the worker streams the cached answer and skips GPU work.
RAG: vector search in QDRANT_COLLECTION returns top‑k chunks with scores ≥ RAG_SCORE_THRESHOLD.
Dispatch: job JSON is XADD’d to the Redis stream (REDIS_STREAM_KEY, default inference_queue). The worker subscribes to result:{request_id} before enqueueing so no early tokens are missed.
GPU consumer (thunder.py) reads the stream with a consumer group, runs vLLM, and PUBLISHes JSON messages to result:{request_id}.
Worker forwards each message as Server-Sent Events (text/event-stream) to the browser. After the final chunk it may write the full answer into the semantic cache collection.

Prerequisites

Component	Notes
Python 3.11+	Worker Docker image uses 3.11; local runs should match for fewer surprises.
Node.js 18+	For the Vite client.
Redis 7	Streams + Pub/Sub; password optional via URL.
Qdrant	Local (`localhost:6333`) or cloud (`QDRANT_URL` + `QDRANT_API_KEY`). Collections must exist (see `worker_module/setup_qdrant.py`).
CUDA + vLLM	Only on machines running `vLLM/thunder.py`; not required for the CPU worker or static client build.
Terraform 1.x	Only if you deploy `terraform/`.

Quick start (local)

1. Qdrant and indexes

Point worker_module/.env at your Qdrant instance (see Configuration), then from worker_module/:

pip install -r requirements.txt
python setup_qdrant.py

This creates the RAG collection (QDRANT_COLLECTION, default my_docs) and the semantic cache collection (CACHE_COLLECTION, default semantic_cache) and seeds sample documents.

2. Redis + worker (Docker Compose)

From worker_module/:

# Create worker_module/.env with at least Redis and Qdrant settings (see Configuration)
docker compose up --build

Compose starts Redis and the context-builder service on port 8000. Ensure .env contains at least QDRANT_HOST / QDRANT_PORT or QDRANT_URL so the worker can reach Qdrant (Qdrant is not defined in this compose file by default).

3. GPU consumer (optional, separate host)

On a GPU machine with vLLM installed, from the vLLM/ directory (so config.yaml resolves):

pip install vllm redis pyyaml python-dotenv
set REDIS_URL=redis://localhost:6379
python thunder.py

Edit vLLM/config.yaml for model name, memory utilization, and vLLM engine limits.

4. Web client

From client/:

npm install

Create client/.env:

VITE_API_URL=http://localhost:8000

npm run dev

Open the URL Vite prints (typically http://localhost:5173).

CORS: If the API and dev server use different origins, configure CORS on FastAPI or a Vite dev proxy. See client/README.md.

Production static build + nginx: see client/README.md and client/Dockerfile.

Configuration

Worker settings are environment variables loaded from worker_module/.env (see worker_module/config.py).

Variable	Default	Purpose
`EMBEDDING_MODEL`	`all-MiniLM-L6-v2`	Sentence-transformers model id.
`THREAD_POOL_SIZE`	CPU count	Threads for `model.encode`.
`QDRANT_URL`	(empty)	Remote Qdrant base URL; if set, used instead of host/port.
`QDRANT_API_KEY`	(empty)	Cloud Qdrant API key.
`QDRANT_HOST`	`localhost`	Local Qdrant host.
`QDRANT_PORT`	`6333`	Local Qdrant port.
`QDRANT_COLLECTION`	`my_docs`	RAG collection name.
`RAG_TOP_K`	`5`	Chunks retrieved per query.
`RAG_SCORE_THRESHOLD`	`0.6`	Minimum similarity for RAG hits.
`CACHE_COLLECTION`	`semantic_cache`	Semantic cache collection.
`CACHE_SIMILARITY_THRESHOLD`	`0.92`	Minimum similarity for cache hit.
`REDIS_URL`	`redis://localhost:6379`	Redis connection URL.
`REDIS_STREAM_KEY`	`inference_queue`	Stream name for GPU jobs (must match `thunder.py`).
`REDIS_STREAM_MAXLEN`	`10000`	Approximate max stream length.
`RESULT_TIMEOUT_SEC`	`60`	Max wait for GPU chunks per request.
`MAX_CONCURRENT_PER_PROCESS`	`250`	Semaphore for embed/RAG/dispatch phase.
`HOST` / `PORT`	`0.0.0.0` / `8000`	Uvicorn bind.
`UVICORN_WORKERS`	`1`	Process count in `main.py`.

GPU script vLLM/thunder.py uses:

Variable	Default	Purpose
`REDIS_URL`	`redis://localhost:6379`	Redis for stream + pub/sub.
`REDIS_KEY`	(none)	Redis password if required.

Model and engine args come from vLLM/config.yaml.

HTTP API

Base URL: worker root (e.g. http://localhost:8000). All JSON bodies use UTF-8.

`GET /health`

Returns JSON such as {"status":"ok"} — used for probes and load balancers.

`POST /build-context`

Request body (InferRequest):

{ "text": "Your question or prompt" }

Response: Content-Type: text/event-stream (SSE).

Each event line looks like:

data: {"chunk":"...","is_final":false}

Terminal event:

data: {"chunk":"","is_final":true}
data: [DONE]

Cache hits may include "cache_hit": true. Errors may appear as JSON with an error field in the payload (see worker implementation).

Integrators should parse SSE data: lines (strip the data: prefix, then JSON-parse) or use an SSE-aware client.

Redis protocol

Mechanism	Key / pattern	Payload
Stream	`inference_queue` (configurable via `REDIS_STREAM_KEY`)	Field `data`: JSON with `request_id`, `prompt`, `context` (list of chunk dicts), `embedding`, metadata.
Consumer group	Group `gpu_workers` (in `thunder.py`)	Used by vLLM workers; create is idempotent.
Pub/Sub	`result:{request_id}`	JSON: `chunk`, `is_final`, optional `error`.

Stream entries are acknowledged (XACK) by the GPU consumer after a successful final generation.

GPU worker (vLLM)

vLLM/thunder.py builds a prompt from RAG context chunks and the user prompt, runs AsyncLLMEngine.generate, and publishes incremental text to result:{request_id}.

Requirements:

Working Redis reachable at REDIS_URL.
config.yaml in the current working directory when launching the script.
vLLM and a compatible GPU / driver stack for the chosen model in config.yaml.

Keep REDIS_STREAM_KEY in the CPU worker aligned with the stream name consumed in thunder.py (default inference_queue).

Cloud deployment

For DigitalOcean (droplets, Redis, load balancer, firewall, VPC, scaling notes):

Create a terraform/terraform.tfvars with your provider token and sizing; terraform.tfvars is gitignored. Use terraform output for load balancer and Redis addresses after apply.

Troubleshooting

Symptom	Things to check
503 from Qdrant	Qdrant running; `QDRANT_URL` / host+port; collections created (`setup_qdrant.py`).
503 queue / dispatch	Redis up; `REDIS_URL` correct from worker and GPU host.
No tokens / timeout	GPU consumer running; same Redis; stream and consumer group healthy; GPU logs for OOM or model errors.
Stream never drains	Consumer group name / consumer process; stuck pending messages (`XPENDING`).
Client shows no text	Response is SSE with a `data:` prefix — parsers must strip it before `JSON.parse`. The metrics bar in the sample UI may still display `/infer`; the implemented route is `/build-context`.
CORS in dev	Align origins or add a Vite proxy / FastAPI CORS middleware.

License

No root license file is present in this repository; add one if you open-source or distribute the project.

Document	Content
`client/README.md`	Vite scripts, Docker build args, legacy NDJSON notes vs current SSE worker.
`terraform/README.md`	DO resources, outputs, autoscale script outline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DistroLLM (distro_llm)

Table of contents

Architecture

Repository layout

Request lifecycle

Prerequisites

Quick start (local)

1. Qdrant and indexes

2. Redis + worker (Docker Compose)

3. GPU consumer (optional, separate host)

4. Web client

Configuration

HTTP API

`GET /health`

`POST /build-context`

Redis protocol

GPU worker (vLLM)

Cloud deployment

Troubleshooting

Further reading

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.vscode		.vscode
client		client
connection_tests		connection_tests
gpu_worker		gpu_worker
terraform		terraform
vLLM		vLLM
worker_module		worker_module
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

DistroLLM (distro_llm)

Table of contents

Architecture

Repository layout

Request lifecycle

Prerequisites

Quick start (local)

1. Qdrant and indexes

2. Redis + worker (Docker Compose)

3. GPU consumer (optional, separate host)

4. Web client

Configuration

HTTP API

GET /health

POST /build-context

Redis protocol

GPU worker (vLLM)

Cloud deployment

Troubleshooting

Further reading

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`GET /health`

`POST /build-context`

Packages