Skip to content

Releases: varad-more/selfhosted-chat-api

v1.0.0 — Multi-backend LLM gateway

22 Apr 20:08

Choose a tag to compare

First stable release.

A self-hosted FastAPI gateway that exposes OpenAI-compatible and
Anthropic Messages-compatible APIs in front of any open-source LLM
runtime on your own hardware.

Highlights

  • Any OSS LLM backend, one env var + one compose profile: vllm,
    ollama, llamacpp, tgi, sglang, localai, lmstudio, or any
    OpenAI-compatible URL.
  • Full API surface: /v1/chat/completions, /v1/completions,
    /v1/embeddings, /v1/models (OpenAI) plus /v1/messages and
    /v1/messages/count_tokens (Anthropic). Streaming works in both
    directions — the gateway translates OpenAI SSE deltas into the
    canonical Anthropic event stream.
  • Production hardening: structured JSON logs with request IDs,
    Prometheus /metrics, /livez + /readyz + /health probes,
    token-bucket rate limiting, CORS, consistent error envelopes, shared
    httpx.AsyncClient with lifespan management.
  • Hardened container: non-root, read-only rootfs, dropped capabilities,
    no-new-privileges, HEALTHCHECK.
  • Tests + CI: 38 pytest tests using httpx.MockTransport, ruff lint,
    Docker build, and compose-profile validation across all backends.
  • Laptop-friendly demo: make demo boots an Ollama + tiny-model stack
    with no GPU required.

Quick start

git clone https://github.com/varad-more/selfhosted-chat-api
cd selfhosted-chat-api
make demo                    # CPU-only, laptop-friendly
# or
make env-vllm && make up BACKEND=vllm   # GPU host with vLLM

Then point any OpenAI or Anthropic SDK at http://127.0.0.1:8000/v1.

Docs

  • README.md — overview, architecture, reproducibility matrix, peer-sharing guide
  • docs/BACKENDS.md — per-backend launch flags and quirks
  • docs/MODELS.md — curated open-source model catalog and GPU sizing
  • docs/API_OPENAI.md / docs/API_CLAUDE.md — endpoint reference
  • docs/DEPLOYMENT.md / docs/OPERATIONS.md — day-1 and day-2

License

MIT.