Skip to content

TeapotSmashers/go-chi-observability-template

Repository files navigation

Conversation Affinity Load Balancer

A production-grade Go load balancer for VLLM (or any OpenAI-compatible) inference servers that maintains conversation affinity using SQLite, selects backends based on live load metrics, and streams responses transparently without modifying requests.

Features

  • Conversation Affinity — Routes all requests from the same conversation to the same backend
  • Load-Aware Routing — Selects least-loaded backend for new conversations based on live metrics
  • Transparent Streaming — Zero-copy streaming proxy with no artificial timeouts
  • SQLite Persistence — Lightweight affinity storage with automatic cleanup
  • Prometheus Metrics — Built-in observability for load balancer health
  • Zero Request Mutation — Headers and body pass through unchanged

Quick Start

1. Configure Backends

Create backends.json with your VLLM instances:

[
  {
    "endpoint": "http://10.0.0.1:8000",
    "maxConcurrent": 32
  },
  {
    "endpoint": "http://10.0.0.2:8000",
    "maxConcurrent": 24
  },
  {
    "endpoint": "http://10.0.0.3:8000",
    "maxConcurrent": 16
  }
]

2. Build and Run

go build ./cmd/lb
./lb

The load balancer starts on port 8080 (configurable via LB_PORT).

3. Send Requests

All requests must include the conversation ID header:

curl -X POST http://localhost:8080/v1/chat/completions \
  -H "X-Conversation-ID: conv-12345" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4",
    "messages": [{"role": "user", "content": "Hello!"}],
    "stream": true
  }'

How It Works

Request Flow

Client Request
    ↓
Extract X-Conversation-ID header
    ↓
Lookup in SQLite
    ↓
┌─────────────────┬─────────────────┐
│  Cache Hit      │  Cache Miss     │
│  → Same Backend │  → Select       │
│                 │    Least Loaded │
│                 │  → Store        │
└─────────────────┴─────────────────┘
    ↓
Proxy to Backend (streaming)
    ↓
Response to Client

Backend Selection

For new conversations, the load balancer:

  1. Scrapes /metrics from each backend every 2 seconds
  2. Extracts vllm:num_requests_running and vllm:num_requests_waiting
  3. Computes load: (running + waiting) / maxConcurrent
  4. Routes to backend with lowest load
  5. Stores conversation → backend mapping in SQLite

Affinity Persistence

  • First request: New conversation → load-based selection → store in SQLite
  • Subsequent requests: Same conversation ID → route to stored backend
  • Cleanup: Old conversations (default: 1 hour unused) automatically cleaned up

Configuration

All configuration via environment variables (or .env file):

Variable Default Description
LB_PORT 8080 Load balancer listen port
BACKENDS_CONFIG_PATH ./backends.json Path to backend configuration
AFFINITY_DB_PATH ./data/affinity.db SQLite database location
AFFINITY_CLEANUP_INTERVAL_MINUTES 10 How often to run cleanup
AFFINITY_TTL_HOURS 1 Delete conversations unused for this long
METRICS_SCRAPE_INTERVAL_SECONDS 2 Backend metrics scrape frequency
CONVERSATION_ID_HEADER X-Conversation-ID Header name for conversation ID
LOG_LEVEL info Logging level (debug, info, warn, error)

Example .env

LB_PORT=8080
BACKENDS_CONFIG_PATH=./backends.json
AFFINITY_DB_PATH=./data/affinity.db
CONVERSATION_ID_HEADER=X-Conversation-ID
LOG_LEVEL=info

Endpoints

Method Path Description
ANY /* Proxy to backends (requires X-Conversation-ID header)
GET /health Health check
GET /metrics Prometheus metrics for load balancer

Load Balancer Metrics

  • lb_requests_total{backend_id} — Total requests per backend
  • lb_requests_duration_seconds{backend_id} — Request duration histogram
  • lb_affinity_cache_hits_total — Conversations routed to existing backend
  • lb_affinity_cache_misses_total — New conversations requiring backend selection
  • lb_backend_selection_total{backend_id} — Times each backend was selected

Requirements

VLLM Backend Requirements

Each backend must expose a Prometheus /metrics endpoint with:

  • vllm:num_requests_running — Number of requests currently being processed
  • vllm:num_requests_waiting — Number of requests queued

These metrics are used for load-aware routing decisions.

Client Requirements

Clients must send the conversation ID in the request header:

X-Conversation-ID: your-conversation-id

Missing this header results in a 400 Bad Request error.

Architecture

Project Structure

cmd/lb/
  main.go           # Entry point
  env.go            # Environment variable loader

internal/
  config/
    types.go        # BackendConfig struct
    loader.go       # Load backends.json
  
  affinity/
    store.go        # SQLite operations (lookup, insert, cleanup)
  
  metrics/
    types.go        # BackendStats struct
    parser.go       # Parse Prometheus text format
    store.go        # Thread-safe stats cache
    scraper.go      # Background metrics collection
  
  balancer/
    selector.go     # Backend selection logic
  
  proxy/
    handler.go      # Main HTTP proxy handler
    reverse_proxy.go # httputil.ReverseProxy wrapper
  
  observability/
    logger.go       # Zap structured logging
    metrics.go      # Prometheus metrics
    request_id.go   # Request ID generation
    middleware.go   # HTTP middleware
  
  handlers/
    health.go       # Health check handler
    response.go     # JSON response helpers
  
  server/
    router.go       # Chi router setup

data/
  affinity.db       # SQLite database (auto-created)

backends.json       # Backend configuration

Key Design Decisions

  1. SQLite with max_open_conns=1 — Serializes writes, prevents contention
  2. No request body parsing — Zero-copy proxy, no buffering
  3. No timeouts — Supports long-running streaming requests
  4. Stale metrics on error — Continues routing using last known good metrics
  5. Index-based backend IDs — Simple array access, no string comparisons

Deployment

Docker

FROM golang:1.25-alpine AS builder

RUN apk add --no-cache gcc musl-dev sqlite-dev

WORKDIR /app
COPY go.mod go.sum ./
RUN go mod download

COPY . .
RUN go build -o lb ./cmd/lb

FROM alpine:latest
RUN apk add --no-cache sqlite-libs

COPY --from=builder /app/lb /usr/local/bin/
COPY backends.json /etc/lb/backends.json

ENV BACKENDS_CONFIG_PATH=/etc/lb/backends.json
ENV AFFINITY_DB_PATH=/var/lib/lb/affinity.db

EXPOSE 8080
CMD ["lb"]

Kubernetes

apiVersion: v1
kind: ConfigMap
metadata:
  name: lb-config
data:
  backends.json: |
    [
      {"endpoint": "http://vllm-0:8000", "maxConcurrent": 32},
      {"endpoint": "http://vllm-1:8000", "maxConcurrent": 32}
    ]
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: lb
spec:
  serviceName: lb
  replicas: 1  # Single instance (SQLite constraint)
  template:
    spec:
      containers:
      - name: lb
        image: your-registry/lb:latest
        env:
        - name: BACKENDS_CONFIG_PATH
          value: /config/backends.json
        - name: AFFINITY_DB_PATH
          value: /data/affinity.db
        volumeMounts:
        - name: config
          mountPath: /config
        - name: data
          mountPath: /data
      volumes:
      - name: config
        configMap:
          name: lb-config
  volumeClaimTemplates:
  - metadata:
      name: data
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 1Gi

Monitoring

Grafana Dashboard

Query examples:

# Request rate per backend
rate(lb_requests_total[5m])

# Average request duration
rate(lb_requests_duration_seconds_sum[5m]) / rate(lb_requests_duration_seconds_count[5m])

# Cache hit ratio
rate(lb_affinity_cache_hits_total[5m]) / (rate(lb_affinity_cache_hits_total[5m]) + rate(lb_affinity_cache_misses_total[5m]))

# Backend selection distribution
lb_backend_selection_total

Logs

Structured JSON logs with request IDs:

{
  "level": "info",
  "msg": "new conversation routed",
  "conversation_id": "conv-12345",
  "backend_id": 0,
  "backend_endpoint": "http://10.0.0.1:8000",
  "request_id": "f47ac10b-58cc-4372-a567-0e02b2c3d479"
}

Troubleshooting

400 Bad Request: Missing Conversation ID

Cause: Client didn't send the X-Conversation-ID header.

Fix: Add the header to all requests:

curl -H "X-Conversation-ID: conv-123" ...

Backend Returns 503

Cause: All backends are at capacity or unhealthy.

Fix:

  1. Check backend /metrics endpoints are accessible
  2. Verify backends are running and healthy
  3. Check load balancer logs for scrape errors

Database Locked Errors

Cause: Concurrent writes to SQLite with incorrect configuration.

Fix: Ensure db.SetMaxOpenConns(1) is set (already configured).

Stale Backend Selection

Cause: Metrics scraping failing, using old load data.

Fix: Check backend connectivity, review scrape error logs.

Performance

Benchmarked on a 2020 M1 MacBook Pro:

  • Throughput: ~50K req/s (empty backends)
  • Latency: ~150µs per request (affinity lookup + proxy setup)
  • Memory: ~10MB base + ~100 bytes per conversation
  • SQLite: ~10K writes/sec (affinity inserts)

Limitations

  1. Single instance only — SQLite file-based storage (use PostgreSQL for HA)
  2. No circuit breaker — Continues routing to failing backends
  3. No rate limiting — Relies on backend capacity limits
  4. No active health checks — Only passive metrics scraping

Future Enhancements

  • PostgreSQL support for multi-instance deployment
  • Circuit breaker for failing backends
  • Active health checks (HTTP probes)
  • Admin API for viewing/clearing affinity mappings
  • Hot reload of backends.json
  • Per-backend rate limiting

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages