Skip to content

pyxis3-ai/awesome-model-agnostic-llm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Awesome Model-Agnostic LLM

A curated list of tools, frameworks, and resources for running large language models without locking into a single model vendor, cloud, or inference runtime.

Curated by PYXIS3.


Why model-agnostic?

The cost of locking into one model vendor compounds. When a better, cheaper, or faster model ships, you want to switch in minutes, not migrate over a quarter. Model-agnostic tooling makes that switch cheap.

This list catalogues the building blocks: serving runtimes, routing layers, evaluation harnesses, observability stacks, and standards that keep your stack portable.


Contents


Serving runtimes

The runtime that loads weights and serves inference. Model-agnostic means it works with multiple model families, supports an OpenAI-compatible wire format, and doesn't tie you to a vendor.

  • vLLM: high-throughput inference with PagedAttention. The default for OpenAI-compatible serving on GPU.
  • TGI (Text Generation Inference): HuggingFace's production-grade serving. Strong on streaming.
  • Triton Inference Server: NVIDIA's general inference server with TensorRT-LLM backend for LLMs.
  • SGLang: structured generation language and runtime. RadixAttention for prefix caching.
  • llama.cpp: CPU, GPU, and Metal inference. The reference for quantised CPU serving.
  • Ollama: wraps llama.cpp with model-management UX. OpenAI-compatible API.
  • MLC LLM: universal deployment via Apache TVM. Cross-platform including mobile.
  • LMDeploy: production-grade with TurboMind backend.
  • CTranslate2: fast transformer inference, CPU-first.

Inference routers and gateways

Route requests across models, runtimes, or providers. The Funnel layer in a model-agnostic stack.

  • LiteLLM: unified API across 100+ LLM providers. The reference for client-side routing.
  • OpenRouter: hosted multi-provider router with OpenAI-compatible API.
  • Portkey: gateway with fallback, retry, load-balancing, and semantic caching.
  • Cloudflare AI Gateway: edge-deployed router with caching and rate-limiting.
  • Helicone: observability gateway, OpenAI-compatible proxy.

Evaluation and benchmarking

Measure honestly with the same prompts, the same metrics, and multiple runtimes or models.

  • vllm-bench: TTFT and TPOT benchmark for OpenAI-compatible endpoints (this org's tool).
  • lm-evaluation-harness: EleutherAI's standard for academic benchmarks.
  • OpenAI evals: eval framework and curated benchmark suite.
  • lighteval: HuggingFace's eval framework.
  • Promptfoo: prompt-level evaluation and regression testing.
  • DeepEval: unit-test framework for LLMs.

Observability and monitoring

What's serving, how fast, at what cost, drifting how. The Lens layer.

  • lens: Kubernetes-native observability with LLM-endpoint discovery (this org's tool).
  • Langfuse: tracing and analytics for LLM applications.
  • Arize Phoenix: open-source LLM and ML observability with evaluation.
  • OpenLLMetry: OpenTelemetry conventions for LLM observability.
  • Helicone: logging, analytics, and caching gateway.

Model formats and quantisation

Portable model weights and the tools that compress them.

  • GGUF: successor to GGML. llama.cpp's format. Mainstream for CPU and Mac inference.
  • safetensors: safe alternative to pickle. Default for HuggingFace weights.
  • bitsandbytes: 8-bit and 4-bit quantisation kernels.
  • GPTQ: post-training quantisation, 4-bit.
  • AWQ: Activation-aware Weight Quantisation. Better quality than GPTQ at 4-bit.
  • ExLlamaV2: fast 4-bit and EXL2 quantisation runtime.

Open standards and specifications

The interface standards that make model-agnostic possible.

Vector databases and embeddings

The retrieval side of model-agnostic AI.

  • sentence-transformers: default for open-weight embeddings.
  • sqlite-vec: SQLite vector search extension. Embeddable.
  • Qdrant: open-source vector database, Rust.
  • Milvus: scalable vector database, CNCF graduated.
  • Chroma: developer-friendly vector store.
  • pgvector: PostgreSQL vector extension.

Orchestration and deployment

Run LLMs on Kubernetes, cloud, or on-prem.

  • KServe: Kubernetes-native ML serving. Generic, with LLM support via runtimes.
  • Seldon Core: MLOps platform with LLM inference path.
  • BentoML: model serving framework.
  • Ray Serve: distributed serving on Ray.
  • KEDA: event-driven autoscaling, essential for scale-to-zero LLM serving.
  • Knative: serverless Kubernetes. Scale-to-zero primitive.

Cost and token economics

Measure and control LLM spend.

Adjacent reading


Contributing

PRs welcome. See CONTRIBUTING.md. Add new entries in alphabetical order within each section, with a one-line description.

Maintenance

Supporting documentation lives in docs/, example inputs live in examples/, and lightweight validation notes live in tests/smoke/.

About

Curated list of open-source LLM-serving runtimes, routers, evaluators, and standards. Run LLMs without locking into one vendor.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors