Squeezing text-diffusion models onto your laptop. ⚡
An open-source effort to make diffusion-based language and vision-language models run efficiently on consumer hardware through quantization, optimization, and memory-efficient inference — one model at a time.
Diffusion-based (V)LMs are built for high-end GPUs and servers. This project asks a simpler question: how small can we make them before they stop being useful? Every model we tackle gets the same treatment — measure its real footprint, quantize it, prove it still works, and document exactly what fits where.
- 4-bit and 8-bit quantization
- Memory-efficient diffusion inference
- Vision encoder compression
- ONNX, TensorRT, and OpenVINO optimization
- CPU and integrated GPU acceleration
- Apple Silicon support
- Low-RAM deployment techniques
- Benchmarking quality vs. performance tradeoffs
| Model | Status | Result |
|---|---|---|
| Nemotron-Labs-Diffusion-VLM-8B | ✅ 4-bit proven | 5.6 GiB checkpoint, runs in 8.3 GiB (fits a 16 GB laptop), 0-point accuracy drop on MMLU + ScienceQA |
More models coming. Each one follows the same workflow: footprint → quantize → verify → benchmark.
Push the boundaries of local AI by bringing state-of-the-art diffusion models to everyday laptops — and documenting every breakthrough along the way.
🚧 Experimental Research Project
Contributions, benchmarks, optimization ideas, and reproducible results are welcome.
- docs/WORKING_NOTES.md — environment setup, model quirks, how to run each phase, troubleshooting, and the roadmap.
- reports/weight_footprint.md — measured weight footprint and per-precision projections.
- reports/quantization_results.md — BF16 vs 4-bit results (3× smaller, ~2× less memory, no meaningful quality loss).
- reports/benchmark.md — speed/memory benchmark
(
python -m text_diffusion_quantization.benchmark). - reports/eval.md — accuracy on MMLU + ScienceQA proving
4-bit ≈ BF16 (
python -m text_diffusion_quantization.evaluate).