A production-grade Go load balancer for VLLM (or any OpenAI-compatible) inference servers that maintains conversation affinity using SQLite, selects backends based on live load metrics, and streams responses transparently without modifying requests.
- ✅ Conversation Affinity — Routes all requests from the same conversation to the same backend
- ✅ Load-Aware Routing — Selects least-loaded backend for new conversations based on live metrics
- ✅ Transparent Streaming — Zero-copy streaming proxy with no artificial timeouts
- ✅ SQLite Persistence — Lightweight affinity storage with automatic cleanup
- ✅ Prometheus Metrics — Built-in observability for load balancer health
- ✅ Zero Request Mutation — Headers and body pass through unchanged
Create backends.json with your VLLM instances:
[
{
"endpoint": "http://10.0.0.1:8000",
"maxConcurrent": 32
},
{
"endpoint": "http://10.0.0.2:8000",
"maxConcurrent": 24
},
{
"endpoint": "http://10.0.0.3:8000",
"maxConcurrent": 16
}
]go build ./cmd/lb
./lbThe load balancer starts on port 8080 (configurable via LB_PORT).
All requests must include the conversation ID header:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "X-Conversation-ID: conv-12345" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": true
}'Client Request
↓
Extract X-Conversation-ID header
↓
Lookup in SQLite
↓
┌─────────────────┬─────────────────┐
│ Cache Hit │ Cache Miss │
│ → Same Backend │ → Select │
│ │ Least Loaded │
│ │ → Store │
└─────────────────┴─────────────────┘
↓
Proxy to Backend (streaming)
↓
Response to Client
For new conversations, the load balancer:
- Scrapes
/metricsfrom each backend every 2 seconds - Extracts
vllm:num_requests_runningandvllm:num_requests_waiting - Computes load:
(running + waiting) / maxConcurrent - Routes to backend with lowest load
- Stores conversation → backend mapping in SQLite
- First request: New conversation → load-based selection → store in SQLite
- Subsequent requests: Same conversation ID → route to stored backend
- Cleanup: Old conversations (default: 1 hour unused) automatically cleaned up
All configuration via environment variables (or .env file):
| Variable | Default | Description |
|---|---|---|
LB_PORT |
8080 |
Load balancer listen port |
BACKENDS_CONFIG_PATH |
./backends.json |
Path to backend configuration |
AFFINITY_DB_PATH |
./data/affinity.db |
SQLite database location |
AFFINITY_CLEANUP_INTERVAL_MINUTES |
10 |
How often to run cleanup |
AFFINITY_TTL_HOURS |
1 |
Delete conversations unused for this long |
METRICS_SCRAPE_INTERVAL_SECONDS |
2 |
Backend metrics scrape frequency |
CONVERSATION_ID_HEADER |
X-Conversation-ID |
Header name for conversation ID |
LOG_LEVEL |
info |
Logging level (debug, info, warn, error) |
LB_PORT=8080
BACKENDS_CONFIG_PATH=./backends.json
AFFINITY_DB_PATH=./data/affinity.db
CONVERSATION_ID_HEADER=X-Conversation-ID
LOG_LEVEL=info| Method | Path | Description |
|---|---|---|
ANY |
/* |
Proxy to backends (requires X-Conversation-ID header) |
GET |
/health |
Health check |
GET |
/metrics |
Prometheus metrics for load balancer |
lb_requests_total{backend_id}— Total requests per backendlb_requests_duration_seconds{backend_id}— Request duration histogramlb_affinity_cache_hits_total— Conversations routed to existing backendlb_affinity_cache_misses_total— New conversations requiring backend selectionlb_backend_selection_total{backend_id}— Times each backend was selected
Each backend must expose a Prometheus /metrics endpoint with:
vllm:num_requests_running— Number of requests currently being processedvllm:num_requests_waiting— Number of requests queued
These metrics are used for load-aware routing decisions.
Clients must send the conversation ID in the request header:
X-Conversation-ID: your-conversation-idMissing this header results in a 400 Bad Request error.
cmd/lb/
main.go # Entry point
env.go # Environment variable loader
internal/
config/
types.go # BackendConfig struct
loader.go # Load backends.json
affinity/
store.go # SQLite operations (lookup, insert, cleanup)
metrics/
types.go # BackendStats struct
parser.go # Parse Prometheus text format
store.go # Thread-safe stats cache
scraper.go # Background metrics collection
balancer/
selector.go # Backend selection logic
proxy/
handler.go # Main HTTP proxy handler
reverse_proxy.go # httputil.ReverseProxy wrapper
observability/
logger.go # Zap structured logging
metrics.go # Prometheus metrics
request_id.go # Request ID generation
middleware.go # HTTP middleware
handlers/
health.go # Health check handler
response.go # JSON response helpers
server/
router.go # Chi router setup
data/
affinity.db # SQLite database (auto-created)
backends.json # Backend configuration
- SQLite with
max_open_conns=1— Serializes writes, prevents contention - No request body parsing — Zero-copy proxy, no buffering
- No timeouts — Supports long-running streaming requests
- Stale metrics on error — Continues routing using last known good metrics
- Index-based backend IDs — Simple array access, no string comparisons
FROM golang:1.25-alpine AS builder
RUN apk add --no-cache gcc musl-dev sqlite-dev
WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download
COPY . .
RUN go build -o lb ./cmd/lb
FROM alpine:latest
RUN apk add --no-cache sqlite-libs
COPY --from=builder /app/lb /usr/local/bin/
COPY backends.json /etc/lb/backends.json
ENV BACKENDS_CONFIG_PATH=/etc/lb/backends.json
ENV AFFINITY_DB_PATH=/var/lib/lb/affinity.db
EXPOSE 8080
CMD ["lb"]apiVersion: v1
kind: ConfigMap
metadata:
name: lb-config
data:
backends.json: |
[
{"endpoint": "http://vllm-0:8000", "maxConcurrent": 32},
{"endpoint": "http://vllm-1:8000", "maxConcurrent": 32}
]
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: lb
spec:
serviceName: lb
replicas: 1 # Single instance (SQLite constraint)
template:
spec:
containers:
- name: lb
image: your-registry/lb:latest
env:
- name: BACKENDS_CONFIG_PATH
value: /config/backends.json
- name: AFFINITY_DB_PATH
value: /data/affinity.db
volumeMounts:
- name: config
mountPath: /config
- name: data
mountPath: /data
volumes:
- name: config
configMap:
name: lb-config
volumeClaimTemplates:
- metadata:
name: data
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 1GiQuery examples:
# Request rate per backend
rate(lb_requests_total[5m])
# Average request duration
rate(lb_requests_duration_seconds_sum[5m]) / rate(lb_requests_duration_seconds_count[5m])
# Cache hit ratio
rate(lb_affinity_cache_hits_total[5m]) / (rate(lb_affinity_cache_hits_total[5m]) + rate(lb_affinity_cache_misses_total[5m]))
# Backend selection distribution
lb_backend_selection_total
Structured JSON logs with request IDs:
{
"level": "info",
"msg": "new conversation routed",
"conversation_id": "conv-12345",
"backend_id": 0,
"backend_endpoint": "http://10.0.0.1:8000",
"request_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479"
}Cause: Client didn't send the X-Conversation-ID header.
Fix: Add the header to all requests:
curl -H "X-Conversation-ID: conv-123" ...Cause: All backends are at capacity or unhealthy.
Fix:
- Check backend
/metricsendpoints are accessible - Verify backends are running and healthy
- Check load balancer logs for scrape errors
Cause: Concurrent writes to SQLite with incorrect configuration.
Fix: Ensure db.SetMaxOpenConns(1) is set (already configured).
Cause: Metrics scraping failing, using old load data.
Fix: Check backend connectivity, review scrape error logs.
Benchmarked on a 2020 M1 MacBook Pro:
- Throughput: ~50K req/s (empty backends)
- Latency: ~150µs per request (affinity lookup + proxy setup)
- Memory: ~10MB base + ~100 bytes per conversation
- SQLite: ~10K writes/sec (affinity inserts)
- Single instance only — SQLite file-based storage (use PostgreSQL for HA)
- No circuit breaker — Continues routing to failing backends
- No rate limiting — Relies on backend capacity limits
- No active health checks — Only passive metrics scraping
- PostgreSQL support for multi-instance deployment
- Circuit breaker for failing backends
- Active health checks (HTTP probes)
- Admin API for viewing/clearing affinity mappings
- Hot reload of
backends.json - Per-backend rate limiting
MIT