Awesome Model-Agnostic LLM

A curated list of tools, frameworks, and resources for running large language models without locking into a single model vendor, cloud, or inference runtime.

Curated by PYXIS3.

Why model-agnostic?

The cost of locking into one model vendor compounds. When a better, cheaper, or faster model ships, you want to switch in minutes, not migrate over a quarter. Model-agnostic tooling makes that switch cheap.

This list catalogues the building blocks: serving runtimes, routing layers, evaluation harnesses, observability stacks, and standards that keep your stack portable.

Serving runtimes

The runtime that loads weights and serves inference. Model-agnostic means it works with multiple model families, supports an OpenAI-compatible wire format, and doesn't tie you to a vendor.

vLLM: high-throughput inference with PagedAttention. The default for OpenAI-compatible serving on GPU.
TGI (Text Generation Inference): HuggingFace's production-grade serving. Strong on streaming.
Triton Inference Server: NVIDIA's general inference server with TensorRT-LLM backend for LLMs.
SGLang: structured generation language and runtime. RadixAttention for prefix caching.
llama.cpp: CPU, GPU, and Metal inference. The reference for quantised CPU serving.
Ollama: wraps llama.cpp with model-management UX. OpenAI-compatible API.
MLC LLM: universal deployment via Apache TVM. Cross-platform including mobile.
LMDeploy: production-grade with TurboMind backend.
CTranslate2: fast transformer inference, CPU-first.

Inference routers and gateways

Route requests across models, runtimes, or providers. The Funnel layer in a model-agnostic stack.

LiteLLM: unified API across 100+ LLM providers. The reference for client-side routing.
OpenRouter: hosted multi-provider router with OpenAI-compatible API.
Portkey: gateway with fallback, retry, load-balancing, and semantic caching.
Cloudflare AI Gateway: edge-deployed router with caching and rate-limiting.
Helicone: observability gateway, OpenAI-compatible proxy.

Evaluation and benchmarking

Measure honestly with the same prompts, the same metrics, and multiple runtimes or models.

vllm-bench: TTFT and TPOT benchmark for OpenAI-compatible endpoints (this org's tool).
lm-evaluation-harness: EleutherAI's standard for academic benchmarks.
OpenAI evals: eval framework and curated benchmark suite.
lighteval: HuggingFace's eval framework.
Promptfoo: prompt-level evaluation and regression testing.
DeepEval: unit-test framework for LLMs.

Observability and monitoring

What's serving, how fast, at what cost, drifting how. The Lens layer.

lens: Kubernetes-native observability with LLM-endpoint discovery (this org's tool).
Langfuse: tracing and analytics for LLM applications.
Arize Phoenix: open-source LLM and ML observability with evaluation.
OpenLLMetry: OpenTelemetry conventions for LLM observability.
Helicone: logging, analytics, and caching gateway.

Model formats and quantisation

Portable model weights and the tools that compress them.

GGUF: successor to GGML. llama.cpp's format. Mainstream for CPU and Mac inference.
safetensors: safe alternative to pickle. Default for HuggingFace weights.
bitsandbytes: 8-bit and 4-bit quantisation kernels.
GPTQ: post-training quantisation, 4-bit.
AWQ: Activation-aware Weight Quantisation. Better quality than GPTQ at 4-bit.
ExLlamaV2: fast 4-bit and EXL2 quantisation runtime.

Open standards and specifications

The interface standards that make model-agnostic possible.

OpenAI Chat Completions: de facto standard for chat completions API.
OpenAI Embeddings: de facto standard for embedding API.
Anthropic Messages API: increasingly adopted; many gateways translate to and from this.
Model Context Protocol (MCP): open standard for tool and context integration.
ONNX: Open Neural Network Exchange. Cross-runtime model format.

Vector databases and embeddings

The retrieval side of model-agnostic AI.

sentence-transformers: default for open-weight embeddings.
sqlite-vec: SQLite vector search extension. Embeddable.
Qdrant: open-source vector database, Rust.
Milvus: scalable vector database, CNCF graduated.
Chroma: developer-friendly vector store.
pgvector: PostgreSQL vector extension.

Orchestration and deployment

Run LLMs on Kubernetes, cloud, or on-prem.

KServe: Kubernetes-native ML serving. Generic, with LLM support via runtimes.
Seldon Core: MLOps platform with LLM inference path.
BentoML: model serving framework.
Ray Serve: distributed serving on Ray.
KEDA: event-driven autoscaling, essential for scale-to-zero LLM serving.
Knative: serverless Kubernetes. Scale-to-zero primitive.

Cost and token economics

Measure and control LLM spend.

OpenCost: Kubernetes cost monitoring.
Helicone cost dashboards: per-application LLM cost.
Langfuse cost tracking: per-trace cost.

Adjacent reading

PYXIS3 architecture thesis: the operating-model argument behind this category.
Stas Bekman: LLM/VLM training and inference scaling: practical large-scale ML engineering reference.
The Vector Database Cambrian Explosion: context for the retrieval side.

Contributing

PRs welcome. See CONTRIBUTING.md. Add new entries in alphabetical order within each section, with a one-line description.

Maintenance

Supporting documentation lives in docs/, example inputs live in examples/, and lightweight validation notes live in tests/smoke/.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
docs		docs
examples		examples
tests/smoke		tests/smoke
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome Model-Agnostic LLM

Why model-agnostic?

Contents

Serving runtimes

Inference routers and gateways

Evaluation and benchmarking

Observability and monitoring

Model formats and quantisation

Open standards and specifications

Vector databases and embeddings

Orchestration and deployment

Cost and token economics

Adjacent reading

Contributing

Maintenance

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Awesome Model-Agnostic LLM

Why model-agnostic?

Contents

Serving runtimes

Inference routers and gateways

Evaluation and benchmarking

Observability and monitoring

Model formats and quantisation

Open standards and specifications

Vector databases and embeddings

Orchestration and deployment

Cost and token economics

Adjacent reading

Contributing

Maintenance

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages