Skip to content

Latest commit

 

History

History
144 lines (110 loc) · 3.94 KB

File metadata and controls

144 lines (110 loc) · 3.94 KB

OpenAI-Compatible API Reference

Base URL examples:

  • local: http://127.0.0.1:8000
  • remote: https://your-domain.example

Auth (required for all /v1/* endpoints when API_KEYS is set):

  • Authorization: Bearer <API_KEY>
  • or x-api-key: <API_KEY>

Endpoints

Method Path Purpose
GET /v1/models List models from the backend
POST /v1/chat/completions Chat completions (streaming + non-streaming)
POST /v1/completions Legacy text completions (streaming + non-streaming)
POST /v1/embeddings Embeddings (where the backend supports them)

GET /v1/models

curl http://127.0.0.1:8000/v1/models \
  -H "Authorization: Bearer $API_KEY"

Response shape matches the backend's OpenAI-compatible model list.

POST /v1/chat/completions

Recommended primary endpoint. Forwards OpenAI chat bodies directly to the backend, so tools, tool_choice, response_format, seed, and any other OpenAI options pass through when the model/runtime supports them.

Non-streaming

curl http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "messages": [
      {"role": "system", "content": "You are concise."},
      {"role": "user",   "content": "What is vLLM?"}
    ],
    "temperature": 0.2,
    "max_tokens": 200
  }'

Streaming (Server-Sent Events)

Set "stream": true. The gateway proxies the SSE directly (no buffering).

curl -N http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "stream": true,
    "messages": [{"role": "user", "content": "Stream one short sentence."}]
  }'

Event stream ends with data: [DONE] per the OpenAI spec.

Tool/function calling

Send OpenAI-style tools/tool_choice exactly like you would to OpenAI. The gateway forwards them unchanged. Whether tool calling actually works depends on the combination of backend runtime (vLLM/SGLang/TGI/... best, llama.cpp partial) and the chosen model. See MODELS.md.

POST /v1/completions

Legacy prompt-in/prompt-out endpoint. Same streaming semantics as chat.

curl http://127.0.0.1:8000/v1/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "Qwen/Qwen2.5-7B-Instruct",
    "prompt": "Finish this sentence: GPU inference is",
    "temperature": 0.2,
    "max_tokens": 64
  }'

POST /v1/embeddings

Works when the selected backend exposes embeddings (Ollama, SGLang, LocalAI, vLLM with an embedding model, llama.cpp with an embedding GGUF). TGI does not — the gateway returns a structured 501 in that case.

curl http://127.0.0.1:8000/v1/embeddings \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $API_KEY" \
  -d '{
    "model": "BAAI/bge-small-en-v1.5",
    "input": ["self hosted chat api"]
  }'

Python SDK

from openai import OpenAI

client = OpenAI(base_url="http://127.0.0.1:8000/v1", api_key="YOUR_API_KEY")

# Non-streaming
resp = client.chat.completions.create(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Explain tensor parallelism briefly."}],
)
print(resp.choices[0].message.content)

# Streaming
with client.chat.completions.stream(
    model="Qwen/Qwen2.5-7B-Instruct",
    messages=[{"role": "user", "content": "Count to five."}],
) as stream:
    for event in stream:
        if event.type == "content.delta":
            print(event.delta, end="", flush=True)

Error envelope

Gateway-originated errors always use:

{ "error": { "type": "<category>", "message": "<human text>" } }

Common types: invalid_request, unsupported, backend_unreachable, backend_timeout, rate_limit_exceeded. Backend errors pass through with the backend's original status code and body.