Global Contextual Memory Fabric for cross-datacenter LLM serving.
Membrane is a distributed, content-addressed, reconstruction-driven memory system built on the analytical throughput model from the "Prefill-as-a-Service" paper. It separates the KV cache from GPU memory and distributes it across a cluster, enabling optimal throughput and latency for LLM inference.
- Analytical Throughput Model — Verbatim reproduction of Equations (1)–(6) from the paper
- Throughput-Optimal Configuration — Grid search over routing threshold and PD split ratio
- Dual-Timescale Scheduling — Bandwidth-aware short-term routing with long-term reallocation
- Fragment Data Model — Immutable, content-addressed KV segments with structural signatures
- Four In-Memory Indices — Exact, Semantic, Positional, and Co-access lookup
- Reconstruction Engine — Context rebuilding from fragments with prefill fallback
- Multi-Node Networking — Gossip-based cluster management, consistent hashing, peer transfer
- Multi-Tenant Isolation — Canonical store with per-tenant policies and deduplication
- Pluggable Backends — CPU, GPU, Transformers, OpenAI, Anthropic, Ollama compute backends
- Multiple Transports — HTTP (stdlib + FastAPI) and gRPC server options
- Redis Persistence — LRU eviction and distributed storage
- CLI with TUI Dashboard — Live monitoring, cluster status, and interactive setup wizard
- 548+ Tests — Comprehensive test suite across Python 3.10–3.13
git clone https://github.com/sachn-cs/membrane.git
cd membrane
python -m venv .venv
source .venv/bin/activate
# Core installation (typer + rich CLI)
pip install -e ".[dev]"
# Optional: Server dependencies (FastAPI, gRPC, Redis)
pip install -e ".[server]"
# Optional: GPU backend (PyTorch CUDA)
pip install -e ".[gpu]"
# Optional: Local LLM backend (HuggingFace Transformers)
pip install -e ".[local-llm]"# Verify installation
python -c "import membrane; print(f'{len(membrane.__all__)} exports available')"
# Run paper reproduction demo
python scripts/demo.py
# Run full multi-phase demo
python scripts/demo_full.py
# Run multi-node simulation
python scripts/demo_membrane.py
# Start the server
membrane serve --node-id n1 --port 8080 --transport http --compute cpu# Start a Membrane server
membrane serve --node-id n1 --port 8080 --transport http --compute cpu
# Open live TUI dashboard
membrane dashboard --host localhost --port 8080
# Show server status
membrane status
# Show current configuration
membrane configimport membrane
# Create a fragment store
from membrane.fragment_store import FragmentStore
store = FragmentStore()
# Create fragments
from membrane.fragment import Fragment
from membrane.structural_signature import StructuralSignature
sig = StructuralSignature(model="llama-3", layer=0, token_span=(0, 128))
frag = Fragment(content=b"kv-data", signature=sig)
# Store and retrieve
store.put(frag)
retrieved = store.get(frag.content_hash)# Build and run
docker compose up --build
# Run tests
docker compose --profile test run --rm membrane-testsConfiguration is managed via environment variables. See .env.example for all options.
| Variable | Default | Description |
|---|---|---|
MEMBRANE_LOG_LEVEL |
INFO |
Logging level |
MEMBRANE_REDIS_URL |
redis://localhost:6379/0 |
Redis connection URL |
MEMBRANE_NODE_ID |
membrane-0 |
Unique node identifier |
MEMBRANE_TRANSPORT |
http |
Transport protocol (http or grpc) |
MEMBRANE_COMPUTE |
cpu |
Compute backend |
MEMBRANE_PORT |
8080 |
Server listen port |
MEMBRANE_HOST |
0.0.0.0 |
Server bind host |
membrane/
├── model/ # Analytical model (paper reproduction)
│ ├── throughput_model.py # Equations (1)–(6)
│ ├── optimizer.py # Grid search optimizer
│ ├── scheduler.py # Dual-timescale scheduler
│ ├── workload.py # Log-normal workload generator
│ └── simulator.py # End-to-end simulations
├── compute/ # Compute backends (CPU, GPU, API)
├── persistence/ # Storage backends (Memory, Redis)
├── transport/ # Network transports (HTTP, gRPC)
├── network/ # Cluster management and peer networking
├── fragment.py # Core fragment data model
├── indices.py # Four in-memory index types
├── reconstruction_engine.py # Context reconstruction from fragments
├── server.py # Unified production server
├── cli.py # CLI with TUI dashboard
└── ...
tests/ # 548+ tests across Python 3.10–3.13
scripts/ # Demo scripts
deployment/ # Systemd, nginx configs
docs/ # Documentation
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
# Run type checking
python -m mypy membrane/
# Run linting
ruff check membrane/ tests/
# Run format check
ruff format --check membrane/ tests/
# Auto-format code
ruff format membrane/ tests/
# Run with coverage
pytest tests/ --cov=membrane --cov-report=term-missing# Full setup: install, type-check, and run tests
bash scripts/setup.sh
# Clean build artifacts and caches
bash scripts/cleanup.sh| Component | Technology |
|---|---|
| Language | Python 3.10+ |
| CLI | Typer + Rich |
| HTTP Server | FastAPI / uvicorn / stdlib |
| gRPC | grpcio / grpcio-tools |
| Persistence | Redis |
| Compute | PyTorch, HuggingFace Transformers, OpenAI/Anthropic APIs |
| Testing | pytest, mypy |
| Containerization | Docker, Docker Compose |
| CI/CD | GitHub Actions |
| Load Balancer | nginx |
- Kubernetes operator for automatic scaling
- Prometheus/Grafana metrics exporter
- TLS/mTLS for transport encryption
- Rate limiting and API key authentication
- S3-compatible blob storage backend
- WebAssembly compute backend
- Web UI dashboard
- gRPC streaming for real-time updates
- Fragment compression and deduplication improvements
- Multi-region replication policies
See CONTRIBUTING.md for guidelines on how to contribute.
See CODE_OF_CONDUCT.md.
See SECURITY.md for reporting vulnerabilities.
This project is licensed under the MIT License — see LICENSE for details.