Python control plane for inference and self-improving systems.
Warply turns serving intent into a runnable deployment plan: prefill/decode pools,
SkyPilot provisioning, SGLang launch flags, NIXL KV transfer, router endpoints, and an
OpenAI-compatible client. The goal is to make advanced LLM serving programmable from
import warply, without asking every researcher or startup to own Kubernetes, CRDs, or
per-cloud launch glue.
Learn more at warply.ai.
Status: Pre-alpha. Local mock lifecycle, compiler/export, SGLang/NIXL adapters, OpenAI-compatible HTTP client, and SkyPilot Lambda dry-run paths are implemented. Live GPU integration remains gated and experimental.
Modern LLM serving stacks are powerful, but stitching together GPUs, clouds, engine flags, KV-transfer settings, routing, health checks, and client endpoints still takes too much bespoke infrastructure work.
Warply focuses on the user-facing control plane:
- Launch and tear down model-serving systems from Python.
- Compile one declarative spec into provisioning, engine, KV-transfer, and routing plans.
- Scale prefill/decode pools independently as the workload changes.
- Keep cloud provisioning, runtime selection, and client binding behind a small SDK.
- Grow toward rollout, eval, and RL/self-improvement workflows without changing the user's entrypoint.
The intent is simple: a researcher or small team should be able to describe the serving system they want, launch it on their cloud, inspect what was deployed, and iterate without becoming a full-time inference infrastructure team.
| Area | Current support |
|---|---|
| SDK | DisaggEngine, Pool, up(), down(), scale() for local mock, client(), generate() |
| Compiler | Deterministic DeploymentPlan, engine.plan(), engine.export_yaml() |
| Engine | SGLang adapter for prefill, decode, and router process configs |
| KV transfer | NIXL for CUDA plans; kv_transfer="auto" resolves to NIXL on known CUDA GPUs |
| Cloud | SkyPilot Lambda/CoreWeave provider skeleton; Lambda dry-run and task rendering |
| Placement | One prefill node plus N decode nodes in one SkyPilot multi-node cluster |
| Client | Mock local client plus OpenAI-compatible HTTP client for deployed routers |
| Hardware planning | CUDA and ROCm accelerator profiles; live ROCm launch intentionally disabled |
| Speculative decoding | Config and plan export scaffold for engine-native, MTP, EAGLE, DFlash, and draft-model modes |
Install from source:
git clone https://github.com/afifi-yusuf/warply.git
cd warply
pip install -e ".[dev]"Run the no-GPU local lifecycle:
import warply as wp
engine = wp.DisaggEngine(
model="meta-llama/Llama-3.1-8B",
prefill=wp.Pool("1xH100", replicas=1),
decode=wp.Pool("1xH100", replicas=1),
backend="sglang",
kv_transfer="nixl",
cloud="local",
)
engine.up()
print(engine.generate("hello"))
print(engine.status())
engine.down()Inspect the compiled plan:
print(engine.plan())
print(engine.export_yaml())Use WARPLY_SKYPILOT_DRY_RUN=1 to exercise the Lambda control path without GPUs, SkyPilot
credentials, or cloud spend:
WARPLY_SKYPILOT_DRY_RUN=1 python - <<'PY'
import warply as wp
engine = wp.DisaggEngine(
model="meta-llama/Llama-3.1-8B",
prefill=wp.Pool("1xH100", replicas=1),
decode=wp.Pool("1xH100", replicas=2),
cloud="lambda",
)
engine.up()
print(engine.status().endpoint)
engine.down()
PYFor live Lambda integration, install cloud extras and opt in explicitly:
pip install -e ".[cloud,dev]"
WARPLY_INTEGRATION=1 pytest tests/test_integration_lambda.pyLive integration may launch paid GPU instances.
cloud="local"is a mock runtime; it does not start SGLang.- Live cloud
scale()is not implemented yet; relaunch with a new spec. - Cloud disagg currently supports
prefill.replicas == 1anddecode.replicas >= 1. - CUDA/SGLang/NIXL is the only live target under active validation.
- AMD Instinct specs such as
wp.Pool("1xMI300X")compile to ROCm-aware plans, but live ROCm launch fails fast until a ROCm image and transfer backend such as MORI are validated. - Speculative decoding modes compile/export, but backend launch flags are not enabled until exact SGLang/vLLM support is validated.
- KV-aware routing, stats, vLLM/TensorRT-LLM adapters, Dynamo runtime integration, and RL loops are roadmap items.
DisaggEngine spec
-> compiler
-> DeploymentPlan
-> provider adapter SkyPilot, local mock, future direct providers
-> engine adapter SGLang now; vLLM / TensorRT-LLM later
-> KV adapter NIXL now; MORI / Mooncake / LMCache candidates later
-> router + client OpenAI-compatible endpoint
Warply is intentionally Python-first. Hot-path serving remains inside engines and runtimes that already specialize in kernels, batching, scheduling, and transport.
| Phase | Focus |
|---|---|
| Phase 0 | Validate live SGLang/NIXL Lambda serving, add engine.stats(), improve 1:N P/D scaling |
| Phase 1 | vLLM adapter, speculative decoding launch support, KV-aware routing, AWS/CoreWeave polish |
| Phase 2 | RL rollout pools, eval/judge pools, self-improvement workflows, policy-driven scaling |
| Later | ROCm live launch, TensorRT-LLM adapter, richer observability, managed control plane |
Track planned work in GitHub issues.
pip install -e ".[dev]"
ruff check warply tests
pytest -qCI runs the same checks on Python 3.10, 3.11, and 3.12. GPU/cloud tests are skipped unless
explicitly enabled with WARPLY_INTEGRATION=1.
- Website: warply.ai
- Provider status: docs/providers.md
- Issues: bugs, feature requests, and design discussions
- Contributing guide: CONTRIBUTING.md
- Security policy: SECURITY.md
- Code of conduct: CODE_OF_CONDUCT.md
Apache 2.0. See LICENSE.