Fortemi supports multiple LLM inference providers for text generation and embeddings. The default is a local Ollama instance. OpenAI-compatible providers (vLLM, LiteLLM, LocalAI, OpenRouter) require no additional code — point the OpenAI backend at their URL.
| Priority | Variable | Example |
|---|---|---|
| 1 | MATRIC_OLLAMA_URL |
http://gpu-server:11434 |
| 2 | OLLAMA_BASE |
http://gpu-server:11434 |
| 3 | OLLAMA_URL |
http://gpu-server:11434 |
| 4 | OLLAMA_HOST |
http://gpu-server:11434 |
| default | (hardcoded) | http://127.0.0.1:11434 |
| Variable | Default | Description |
|---|---|---|
MATRIC_INFERENCE_DEFAULT |
ollama |
Active backend (ollama or openai) |
MATRIC_OLLAMA_URL |
http://127.0.0.1:11434 |
Ollama base URL |
MATRIC_OLLAMA_GENERATION_MODEL |
qwen3.5:27b |
Ollama generation model |
MATRIC_OLLAMA_EMBEDDING_MODEL |
nomic-embed-text |
Ollama embedding model |
MATRIC_OPENAI_URL |
https://api.openai.com/v1 |
OpenAI-compatible base URL |
MATRIC_OPENAI_API_KEY |
— | API key (falls back to OPENAI_API_KEY) |
MATRIC_OPENAI_GENERATION_MODEL |
gpt-4o-mini |
Generation model for OpenAI backend |
MATRIC_OPENAI_EMBEDDING_MODEL |
text-embedding-3-small |
Embedding model for OpenAI backend |
OPENAI_API_KEY |
— | Fallback API key; also enables OpenRouter auto-discovery |
OPENROUTER_API_KEY |
— | Enables OpenRouter provider for per-request model overrides |
LLAMACPP_BASE_URL |
— | Enables llama.cpp provider (e.g. http://localhost:8080) |
LLAMACPP_API_KEY |
— | Optional API key for llama.cpp server |
Legacy variable names (OLLAMA_BASE, OLLAMA_GEN_MODEL, OLLAMA_EMBED_MODEL) remain supported as fallbacks.
TOML configuration takes priority over environment variables.
Default path: ~/.config/matric-memory/inference.toml
[inference]
default = "ollama"
[inference.ollama]
base_url = "http://127.0.0.1:11434"
generation_model = "qwen3.5:27b"
embedding_model = "nomic-embed-text"Environment variable substitution is supported:
[inference.openai]
api_key = "${OPENAI_API_KEY}"No configuration required. Fortemi connects to http://127.0.0.1:11434 by default.
Pull required models before starting:
ollama pull nomic-embed-text
ollama pull qwen3.5:27b# Remote GPU server
MATRIC_OLLAMA_URL=http://gpu-server:11434
# Docker Compose (container name as hostname)
MATRIC_OLLAMA_URL=http://ollama:11434Pitfall: Ollama's default bind is 127.0.0.1. For remote access, start it with:
OLLAMA_HOST=0.0.0.0 ollama serveMATRIC_INFERENCE_DEFAULT=openai
MATRIC_OPENAI_API_KEY=sk-proj-...
# MATRIC_OPENAI_URL defaults to https://api.openai.com/v1
MATRIC_OPENAI_GENERATION_MODEL=gpt-4o-mini
MATRIC_OPENAI_EMBEDDING_MODEL=text-embedding-3-smallSwitching embedding models changes the vector dimension — regenerate all embeddings afterwards (see Troubleshooting).
OpenRouter provides access to cloud models (Anthropic, Google, Meta, etc.) through an OpenAI-compatible API. It supports generation only — embeddings must use Ollama or OpenAI.
# Use as primary backend
MATRIC_INFERENCE_DEFAULT=openai
MATRIC_OPENAI_URL=https://openrouter.ai/api/v1
MATRIC_OPENAI_API_KEY=sk-or-...
MATRIC_OPENAI_GENERATION_MODEL=anthropic/claude-sonnet-4-20250514
# Or set OPENROUTER_API_KEY to enable it for per-request model overrides only
OPENROUTER_API_KEY=sk-or-...With OPENROUTER_API_KEY set, you can route individual operations to OpenRouter using provider-qualified slugs:
curl -X POST http://localhost:3000/api/v1/jobs \
-H "Content-Type: application/json" \
-d '{"note_id": "...", "job_type": "ai_revision", "model_override": "openrouter:anthropic/claude-sonnet-4-20250514"}'Pitfall: OpenRouter does not host embedding models. Configure a separate embedding backend (Ollama or OpenAI) and use [inference.routing] to split traffic.
llama.cpp's HTTP server exposes an OpenAI-compatible API. Point Fortémi at it with LLAMACPP_BASE_URL:
LLAMACPP_BASE_URL=http://localhost:8080
LLAMACPP_API_KEY= # optional; omit if not requiredUse provider-qualified slugs for per-request routing:
curl -X POST http://localhost:3000/api/v1/jobs \
-H "Content-Type: application/json" \
-d '{"note_id": "...", "job_type": "ai_revision", "model_override": "llamacpp:llama-3.2-3b"}'llama.cpp supports generation only — use Ollama or OpenAI for embeddings.
Hot-swap: Change LLAMACPP_BASE_URL at runtime without restarting the server by sending a PUT to the runtime config API (see Runtime Configuration API).
vLLM exposes an OpenAI-compatible API. Use the OpenAI backend pointed at the vLLM server.
MATRIC_INFERENCE_DEFAULT=openai
MATRIC_OPENAI_URL=http://host:8000/v1
MATRIC_OPENAI_API_KEY=token-if-required
MATRIC_OPENAI_GENERATION_MODEL=meta-llama/Llama-3.1-8B-InstructPitfall: vLLM does not serve embedding models by default. Run a separate Ollama instance for embeddings and use operation routing (see below).
LiteLLM is an OpenAI-compatible proxy that can front multiple upstream providers.
MATRIC_INFERENCE_DEFAULT=openai
MATRIC_OPENAI_URL=http://host:4000/v1
MATRIC_OPENAI_API_KEY=sk-litellm-master-key
MATRIC_OPENAI_GENERATION_MODEL=your-model-aliasMATRIC_INFERENCE_DEFAULT=openai
MATRIC_OPENAI_URL=http://host:8080/v1
# No API key required
MATRIC_OPENAI_GENERATION_MODEL=gpt-3.5-turbo
MATRIC_OPENAI_EMBEDDING_MODEL=all-MiniLM-L6-v2Route embeddings and generation to different backends. Useful for keeping embeddings local while using a cloud API for generation quality. Configure via TOML only (not environment variables).
[inference.routing]
embedding = "ollama" # always local
generation = "openai" # cloud for quality
[inference.fallback]
enabled = true
chain = ["openai", "ollama"] # try openai first, fall back to ollama
max_retries = 1
health_check_timeout_secs = 5Every backend named in routing or fallback must have a corresponding [inference.ollama] or [inference.openai] section. See inference-configuration.md for a full hybrid example.
Any generation job accepts a model_override using a provider-qualified slug:
| Slug format | Routes to |
|---|---|
qwen3.5:9b |
Default provider (Ollama) |
ollama:qwen3.5:27b |
Explicit Ollama |
openai:gpt-4o |
OpenAI |
openrouter:anthropic/claude-sonnet-4-20250514 |
OpenRouter |
llamacpp:model-name |
llama.cpp |
curl -X POST http://localhost:3000/api/v1/jobs \
-H "Content-Type: application/json" \
-d '{"note_id": "...", "job_type": "ai_revision", "model_override": "openai:gpt-4o-mini"}'List all available models and provider health: GET /api/v1/models
| Symptom | Fix |
|---|---|
| "Connection refused" (Ollama) | Confirm ollama serve is running. For remote, start with OLLAMA_HOST=0.0.0.0 ollama serve. |
| "Connection refused" (Docker) | Confirm container name matches MATRIC_OLLAMA_URL. |
| "Model not found" (Ollama) | Run ollama pull <model-name>. |
| "Model not found" (cloud) | Verify the model slug matches the provider's API exactly. |
| "Unauthorized" | Set MATRIC_OPENAI_API_KEY. For local servers, use a dummy value: MATRIC_OPENAI_API_KEY=local. |
| Routing/fallback ignored | Every backend in [inference.routing] or [inference.fallback] must have a configured section. Missing sections fail validation at startup. |
| Config file not loading | Verify ~/.config/matric-memory/inference.toml exists. Enable RUST_LOG=matric_inference=debug to trace loading. |
Embeddings from different models have different dimensions and are incompatible. After changing the embedding model or backend, regenerate all embeddings:
curl -X POST http://localhost:3000/api/v1/jobs/batch \
-H "Content-Type: application/json" \
-d '{"job_type": "regenerate_embeddings", "scope": "all"}'- inference-backends.md — API endpoints, streaming, model profiles
- inference-configuration.md — full TOML reference, routing, fallback
- embedding-model-selection.md — choosing the right model
- ollama-optimization.md — GPU tuning, context window, parallelism