Warply

Python control plane for inference and self-improving systems.

Warply turns serving intent into a runnable deployment plan: prefill/decode pools, SkyPilot provisioning, SGLang launch flags, NIXL KV transfer, router endpoints, and an OpenAI-compatible client. The goal is to make advanced LLM serving programmable from import warply, without asking every researcher or startup to own Kubernetes, CRDs, or per-cloud launch glue.

Learn more at warply.ai.

Status: Pre-alpha. Local mock lifecycle, compiler/export, SGLang/NIXL adapters, OpenAI-compatible HTTP client, and SkyPilot Lambda dry-run paths are implemented. Live GPU integration remains gated and experimental.

Why Warply?

Modern LLM serving stacks are powerful, but stitching together GPUs, clouds, engine flags, KV-transfer settings, routing, health checks, and client endpoints still takes too much bespoke infrastructure work.

Warply focuses on the user-facing control plane:

Launch and tear down model-serving systems from Python.
Compile one declarative spec into provisioning, engine, KV-transfer, and routing plans.
Scale prefill/decode pools independently as the workload changes.
Keep cloud provisioning, runtime selection, and client binding behind a small SDK.
Grow toward rollout, eval, and RL/self-improvement workflows without changing the user's entrypoint.

The intent is simple: a researcher or small team should be able to describe the serving system they want, launch it on their cloud, inspect what was deployed, and iterate without becoming a full-time inference infrastructure team.

What Works Today

Area	Current support
SDK	`DisaggEngine`, `Pool`, `up()`, `down()`, `scale()` for local mock, `client()`, `generate()`
Compiler	Deterministic `DeploymentPlan`, `engine.plan()`, `engine.export_yaml()`
Engine	SGLang adapter for prefill, decode, and router process configs
KV transfer	NIXL for CUDA plans; `kv_transfer="auto"` resolves to NIXL on known CUDA GPUs
Cloud	SkyPilot Lambda/CoreWeave provider skeleton; Lambda dry-run and task rendering
Placement	One prefill node plus N decode nodes in one SkyPilot multi-node cluster
Client	Mock local client plus OpenAI-compatible HTTP client for deployed routers
Hardware planning	CUDA and ROCm accelerator profiles; live ROCm launch intentionally disabled
Speculative decoding	Config and plan export scaffold for engine-native, MTP, EAGLE, DFlash, and draft-model modes

Quick Start

Install from source:

git clone https://github.com/afifi-yusuf/warply.git
cd warply
pip install -e ".[dev]"

Run the no-GPU local lifecycle:

import warply as wp

engine = wp.DisaggEngine(
    model="meta-llama/Llama-3.1-8B",
    prefill=wp.Pool("1xH100", replicas=1),
    decode=wp.Pool("1xH100", replicas=1),
    backend="sglang",
    kv_transfer="nixl",
    cloud="local",
)

engine.up()
print(engine.generate("hello"))
print(engine.status())
engine.down()

Inspect the compiled plan:

print(engine.plan())
print(engine.export_yaml())

Cloud Dry Run

Use WARPLY_SKYPILOT_DRY_RUN=1 to exercise the Lambda control path without GPUs, SkyPilot credentials, or cloud spend:

WARPLY_SKYPILOT_DRY_RUN=1 python - <<'PY'
import warply as wp

engine = wp.DisaggEngine(
    model="meta-llama/Llama-3.1-8B",
    prefill=wp.Pool("1xH100", replicas=1),
    decode=wp.Pool("1xH100", replicas=2),
    cloud="lambda",
)
engine.up()
print(engine.status().endpoint)
engine.down()
PY

For live Lambda integration, install cloud extras and opt in explicitly:

pip install -e ".[cloud,dev]"
WARPLY_INTEGRATION=1 pytest tests/test_integration_lambda.py

Live integration may launch paid GPU instances.

Current Limits

cloud="local" is a mock runtime; it does not start SGLang.
Live cloud scale() is not implemented yet; relaunch with a new spec.
Cloud disagg currently supports prefill.replicas == 1 and decode.replicas >= 1.
CUDA/SGLang/NIXL is the only live target under active validation.
AMD Instinct specs such as wp.Pool("1xMI300X") compile to ROCm-aware plans, but live ROCm launch fails fast until a ROCm image and transfer backend such as MORI are validated.
Speculative decoding modes compile/export, but backend launch flags are not enabled until exact SGLang/vLLM support is validated.
KV-aware routing, stats, vLLM/TensorRT-LLM adapters, Dynamo runtime integration, and RL loops are roadmap items.

Architecture

DisaggEngine spec
  -> compiler
  -> DeploymentPlan
  -> provider adapter      SkyPilot, local mock, future direct providers
  -> engine adapter        SGLang now; vLLM / TensorRT-LLM later
  -> KV adapter            NIXL now; MORI / Mooncake / LMCache candidates later
  -> router + client       OpenAI-compatible endpoint

Warply is intentionally Python-first. Hot-path serving remains inside engines and runtimes that already specialize in kernels, batching, scheduling, and transport.

Roadmap

Phase	Focus
Phase 0	Validate live SGLang/NIXL Lambda serving, add `engine.stats()`, improve 1:N P/D scaling
Phase 1	vLLM adapter, speculative decoding launch support, KV-aware routing, AWS/CoreWeave polish
Phase 2	RL rollout pools, eval/judge pools, self-improvement workflows, policy-driven scaling
Later	ROCm live launch, TensorRT-LLM adapter, richer observability, managed control plane

Track planned work in GitHub issues.

Development

pip install -e ".[dev]"
ruff check warply tests
pytest -q

CI runs the same checks on Python 3.10, 3.11, and 3.12. GPU/cloud tests are skipped unless explicitly enabled with WARPLY_INTEGRATION=1.

Community

Website: warply.ai
Provider status: docs/providers.md
Issues: bugs, feature requests, and design discussions
Contributing guide: CONTRIBUTING.md
Security policy: SECURITY.md
Code of conduct: CODE_OF_CONDUCT.md

License

Apache 2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.github		.github
docs		docs
examples		examples
tests		tests
warply		warply
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Warply

Why Warply?

What Works Today

Quick Start

Cloud Dry Run

Current Limits

Architecture

Roadmap

Development

Community

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Warply

Why Warply?

What Works Today

Quick Start

Cloud Dry Run

Current Limits

Architecture

Roadmap

Development

Community

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages