Skip to content

Latest commit

 

History

History
39 lines (21 loc) · 2.35 KB

File metadata and controls

39 lines (21 loc) · 2.35 KB

A Tutorial on Running llama.cpp on Windows

The tutorial will start with a non MTP model then move on to running the model with MTP.

I.Getting llama.cpp

Head to https://github.com/ggml-org/llama.cpp and then go to releases and get the latest release for windows. Choose Windows x64 (CUDA 12) unless you are running RTX 5000 series or later.

Extract the zip file into any folder you would like, I personally used C:\llama to make things easier.

II.Choosing the Model

I will be using a MoE model rather than a dense model as the hardware I have is limited. The model I will be using is https://huggingface.co/HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive/tree/main

III. MTP vs Non-MTP

MTP give me a boost that allows me to run a Q6 model on my 3080Ti + 32GB DDR5 6000. Without MTP I usually can run Q4 only.

IV. The Model

After you choose you preferred model download the GGUF I chose https://huggingface.co/HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive/blob/main/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-Q6_K_P.gguf

From here you can either leave the model as is or convert the model into an MTP model using my tutorial https://github.com/triple-octopus/MTPConversion

Make sure to get both the model itself and mmproj https://huggingface.co/HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive/blob/main/mmproj-Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-f16.gguf

After you have your GGUF file both for mmproj and the model, I suggest placing them on an SSD, I placed my models in C:\Models

V.Running the Model

Open a terminal window in C:\llama (or whatever directory you put llama.cpp in) and type in the command (modify it to your liking)

llama-server --model C:\Models\Qwen3.6-35BA3B-MTP-UNCENSORED-Q6_K_P.gguf --ctx-size 131072 --jinja --flash-attn on --cache-type-k q4_0 --cache-type-v q4_0 --temperature 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 0.0 --repeat-penalty 1.0 --reasoning-budget 4096 --reasoning-budget-message "...I think I've explored this enough, time to respond." --spec-type draft-mtp --spec-draft-n-max 2 --no-mmap --port 6942 --fit on --n-gpu-layers -1

You can now access your AI model on http://localhost:6942 and if you want to disable the webgui (use it only as an API in opencode for example) you can add --no-ui.

In my case I did not use mmproj but you can add it as an argument to your command.