The gateway exposes an Anthropic-compatible facade that points any Claude SDK at your open-source model. The translation layer handles:
system(string or content blocks)messages(string or text content blocks —tool_resulttext is flattened)- non-streaming requests
- streaming requests (full Anthropic event sequence, translated from OpenAI SSE)
/v1/messages/count_tokens(heuristic when the backend lacks a tokenizer endpoint)
Base URL matches the gateway, e.g. http://127.0.0.1:8000.
Auth: x-api-key: $API_KEY (Anthropic style) or Authorization: Bearer $API_KEY.
Include anthropic-version: 2023-06-01 — the gateway echoes it back.
curl http://127.0.0.1:8000/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: $API_KEY" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"max_tokens": 256,
"system": "You are concise.",
"messages": [
{"role": "user", "content": "Explain GPU inference in one sentence."}
]
}'Response:
{
"id": "msg_...",
"type": "message",
"role": "assistant",
"model": "Qwen/Qwen2.5-7B-Instruct",
"content": [{"type": "text", "text": "..."}],
"stop_reason": "end_turn",
"stop_sequence": null,
"usage": {"input_tokens": 24, "output_tokens": 12}
}Set "stream": true. The gateway emits the canonical Anthropic event
sequence, translated from the backend's OpenAI SSE:
event: message_start
event: content_block_start
event: content_block_delta (one per chunk)
event: content_block_stop
event: message_delta (with stop_reason, output_tokens)
event: message_stop
curl -N http://127.0.0.1:8000/v1/messages \
-H "Content-Type: application/json" \
-H "x-api-key: $API_KEY" \
-H "anthropic-version: 2023-06-01" \
-d '{
"model": "Qwen/Qwen2.5-7B-Instruct",
"max_tokens": 128,
"stream": true,
"messages": [{"role": "user", "content": "Say hi, then stop."}]
}'Returns a heuristic input-token count (roughly 4 characters per token). The backend's real tokenizer is not reachable through the OpenAI HTTP surface, so treat the value as an estimate.
curl http://127.0.0.1:8000/v1/messages/count_tokens \
-H "Content-Type: application/json" \
-H "x-api-key: $API_KEY" \
-d '{
"messages": [{"role": "user", "content": "hello"}]
}'import anthropic
client = anthropic.Anthropic(
base_url="http://127.0.0.1:8000",
api_key="YOUR_API_KEY",
)
msg = client.messages.create(
model="Qwen/Qwen2.5-7B-Instruct",
max_tokens=256,
messages=[{"role": "user", "content": "Give me a 2-line haiku."}],
)
print(msg.content[0].text)system(string or content blocks) → OpenAIsystemmessage- message
content(string or text blocks) → flattened to string stop_sequences→ OpenAIstoptemperature,top_p,max_tokensforwarded as-isfinish_reasonmapping:stop → end_turn,length → max_tokens,tool_calls → tool_use
The facade is intentionally narrow. If you need these, send them directly to
/v1/chat/completions against the backend, which forwards OpenAI tool/vision
fields unchanged:
- Anthropic
tool_useblocks (OpenAI-style tools work on/v1/chat/completions) - Anthropic vision/image blocks
- Anthropic PDF/document blocks
- Anthropic prompt caching directives
Use the Anthropic facade when your codebase is already built around
anthropic.Anthropic and you want the same SDK to target your OSS model.
400ifmodelis missing or body is not JSON401if the API key is missing or invalid501if the selected backend doesn't support chat or streaming- backend errors pass through with the original status and body