diff --git a/README.md b/README.md index 785695284..9c66c78f1 100644 --- a/README.md +++ b/README.md @@ -28,6 +28,41 @@ opportunistic depending on what open weight models exist in a given moment. If a new model will be supported, the old one may be removed completely and no longer supported, unless there is some kind of overlap of abilities. +## Quickstart + +This gets you from a clean checkout to a running model. The example uses the +2-bit imatrix Flash GGUF, which fits machines with 96/128 GB of RAM. See +[Model Weights](#model-weights) for other quants and the larger PRO model, and +[Running models larger than RAM](#running-models-larger-than-ram) for SSD +streaming on smaller machines. + +```sh +# 1. Build for your platform. +make # macOS (Metal) — the primary target +# make cuda-spark # Linux CUDA, DGX Spark / GB10 +# make cuda-generic # Linux CUDA, other local CUDA GPUs +# make cpu # CPU-only diagnostics build (see warning below) + +# 2. Download a model. This pulls several tens of GB into ./gguf/ and points +# ./ds4flash.gguf at it. Downloads resume, so it is safe to re-run. +./download_model.sh q2-imatrix + +# 3. Run a one-shot prompt... +./ds4 -p "Explain Redis streams in one paragraph." --nothink + +# 4. ...or start the interactive chat (multi-turn, /help for commands). +./ds4 + +# 5. ...or start an OpenAI/Anthropic-compatible HTTP server on :8080. +./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192 +``` + +`./ds4flash.gguf` is the default model path for all binaries; pass `-m` to pick +another GGUF from `./gguf/`. The CLI defaults to thinking mode; use `--nothink` +for direct answers. Run `./ds4 --help` and `./ds4-server --help` for the full +flag list. On macOS, **do not use the `cpu` build for inference**: current macOS +versions have a virtual-memory bug that crashes the kernel on the CPU path. + ## Motivations * Very capable open weight models finally exist. DeepSeek v4 Flash feels quasi-frontier. The PRO is even better. Both resist 2 bit quantization very well.