Base URL examples:
- local:
http://127.0.0.1:8000 - remote:
https://your-domain.example
Auth (required for all /v1/* endpoints when API_KEYS is set):
Authorization: Bearer <API_KEY>- or
x-api-key: <API_KEY>
| Method | Path | Purpose |
|---|---|---|
| GET | /v1/models |
List models from the backend |
| POST | /v1/chat/completions |
Chat completions (streaming + non-streaming) |
| POST | /v1/completions |
Legacy text completions (streaming + non-streaming) |
| POST | /v1/embeddings |
Embeddings (where the backend supports them) |
curl http://127.0.0.1:8000/v1/models \
-H "Authorization: Bearer $API_KEY"Response shape matches the backend's OpenAI-compatible model list.
Recommended primary endpoint. Forwards OpenAI chat bodies directly to the
backend, so tools, tool_choice, response_format, seed, and any other
OpenAI options pass through when the model/runtime supports them.
curl http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"messages": [
{"role": "system", "content": "You are concise."},
{"role": "user", "content": "What is vLLM?"}
],
"temperature": 0.2,
"max_tokens": 200
}'Set "stream": true. The gateway proxies the SSE directly (no buffering).
curl -N http://127.0.0.1:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"stream": true,
"messages": [{"role": "user", "content": "Stream one short sentence."}]
}'Event stream ends with data: [DONE] per the OpenAI spec.
Send OpenAI-style tools/tool_choice exactly like you would to OpenAI. The
gateway forwards them unchanged. Whether tool calling actually works depends on
the combination of backend runtime (vLLM/SGLang/TGI/... best, llama.cpp
partial) and the chosen model. See MODELS.md.
Legacy prompt-in/prompt-out endpoint. Same streaming semantics as chat.
curl http://127.0.0.1:8000/v1/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"prompt": "Finish this sentence: GPU inference is",
"temperature": 0.2,
"max_tokens": 64
}'Works when the selected backend exposes embeddings (Ollama, SGLang, LocalAI, vLLM with an embedding model, llama.cpp with an embedding GGUF). TGI does not — the gateway returns a structured 501 in that case.
curl http://127.0.0.1:8000/v1/embeddings \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $API_KEY" \
-d '{
"model": "BAAI/bge-small-en-v1.5",
"input": ["self hosted chat api"]
}'from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="YOUR_API_KEY")
# Non-streaming
resp = client.chat.completions.create(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Explain tensor parallelism briefly."}],
)
print(resp.choices[0].message.content)
# Streaming
with client.chat.completions.stream(
model="Qwen/Qwen2.5-7B-Instruct",
messages=[{"role": "user", "content": "Count to five."}],
) as stream:
for event in stream:
if event.type == "content.delta":
print(event.delta, end="", flush=True)Gateway-originated errors always use:
{ "error": { "type": "<category>", "message": "<human text>" } }Common types: invalid_request, unsupported, backend_unreachable,
backend_timeout, rate_limit_exceeded. Backend errors pass through with the
backend's original status code and body.