A curated list of tools, frameworks, and resources for running large language models without locking into a single model vendor, cloud, or inference runtime.
Curated by PYXIS3.
The cost of locking into one model vendor compounds. When a better, cheaper, or faster model ships, you want to switch in minutes, not migrate over a quarter. Model-agnostic tooling makes that switch cheap.
This list catalogues the building blocks: serving runtimes, routing layers, evaluation harnesses, observability stacks, and standards that keep your stack portable.
- Serving runtimes
- Inference routers and gateways
- Evaluation and benchmarking
- Observability and monitoring
- Model formats and quantisation
- Open standards and specifications
- Vector databases and embeddings
- Orchestration and deployment
- Cost and token economics
- Adjacent reading
The runtime that loads weights and serves inference. Model-agnostic means it works with multiple model families, supports an OpenAI-compatible wire format, and doesn't tie you to a vendor.
- vLLM: high-throughput inference with PagedAttention. The default for OpenAI-compatible serving on GPU.
- TGI (Text Generation Inference): HuggingFace's production-grade serving. Strong on streaming.
- Triton Inference Server: NVIDIA's general inference server with TensorRT-LLM backend for LLMs.
- SGLang: structured generation language and runtime. RadixAttention for prefix caching.
- llama.cpp: CPU, GPU, and Metal inference. The reference for quantised CPU serving.
- Ollama: wraps llama.cpp with model-management UX. OpenAI-compatible API.
- MLC LLM: universal deployment via Apache TVM. Cross-platform including mobile.
- LMDeploy: production-grade with TurboMind backend.
- CTranslate2: fast transformer inference, CPU-first.
Route requests across models, runtimes, or providers. The Funnel layer in a model-agnostic stack.
- LiteLLM: unified API across 100+ LLM providers. The reference for client-side routing.
- OpenRouter: hosted multi-provider router with OpenAI-compatible API.
- Portkey: gateway with fallback, retry, load-balancing, and semantic caching.
- Cloudflare AI Gateway: edge-deployed router with caching and rate-limiting.
- Helicone: observability gateway, OpenAI-compatible proxy.
Measure honestly with the same prompts, the same metrics, and multiple runtimes or models.
- vllm-bench: TTFT and TPOT benchmark for OpenAI-compatible endpoints (this org's tool).
- lm-evaluation-harness: EleutherAI's standard for academic benchmarks.
- OpenAI evals: eval framework and curated benchmark suite.
- lighteval: HuggingFace's eval framework.
- Promptfoo: prompt-level evaluation and regression testing.
- DeepEval: unit-test framework for LLMs.
What's serving, how fast, at what cost, drifting how. The Lens layer.
- lens: Kubernetes-native observability with LLM-endpoint discovery (this org's tool).
- Langfuse: tracing and analytics for LLM applications.
- Arize Phoenix: open-source LLM and ML observability with evaluation.
- OpenLLMetry: OpenTelemetry conventions for LLM observability.
- Helicone: logging, analytics, and caching gateway.
Portable model weights and the tools that compress them.
- GGUF: successor to GGML. llama.cpp's format. Mainstream for CPU and Mac inference.
- safetensors: safe alternative to pickle. Default for HuggingFace weights.
- bitsandbytes: 8-bit and 4-bit quantisation kernels.
- GPTQ: post-training quantisation, 4-bit.
- AWQ: Activation-aware Weight Quantisation. Better quality than GPTQ at 4-bit.
- ExLlamaV2: fast 4-bit and EXL2 quantisation runtime.
The interface standards that make model-agnostic possible.
- OpenAI Chat Completions: de facto standard for chat completions API.
- OpenAI Embeddings: de facto standard for embedding API.
- Anthropic Messages API: increasingly adopted; many gateways translate to and from this.
- Model Context Protocol (MCP): open standard for tool and context integration.
- ONNX: Open Neural Network Exchange. Cross-runtime model format.
The retrieval side of model-agnostic AI.
- sentence-transformers: default for open-weight embeddings.
- sqlite-vec: SQLite vector search extension. Embeddable.
- Qdrant: open-source vector database, Rust.
- Milvus: scalable vector database, CNCF graduated.
- Chroma: developer-friendly vector store.
- pgvector: PostgreSQL vector extension.
Run LLMs on Kubernetes, cloud, or on-prem.
- KServe: Kubernetes-native ML serving. Generic, with LLM support via runtimes.
- Seldon Core: MLOps platform with LLM inference path.
- BentoML: model serving framework.
- Ray Serve: distributed serving on Ray.
- KEDA: event-driven autoscaling, essential for scale-to-zero LLM serving.
- Knative: serverless Kubernetes. Scale-to-zero primitive.
Measure and control LLM spend.
- OpenCost: Kubernetes cost monitoring.
- Helicone cost dashboards: per-application LLM cost.
- Langfuse cost tracking: per-trace cost.
- PYXIS3 architecture thesis: the operating-model argument behind this category.
- Stas Bekman: LLM/VLM training and inference scaling: practical large-scale ML engineering reference.
- The Vector Database Cambrian Explosion: context for the retrieval side.
PRs welcome. See CONTRIBUTING.md. Add new entries in alphabetical order within each section, with a one-line description.
Supporting documentation lives in docs/, example inputs live in examples/, and lightweight validation notes live in tests/smoke/.