Stable Audio 3 — Windows Setup Guide

Tested on Windows 11 Pro, RTX 4090, Python 3.10, May 2026

The official README assumes Linux. This documents every gotcha we hit getting Stable Audio 3 Medium running on Windows with CUDA.

Prerequisites

Tool	Why
Python 3.10	Required by project (`requires-python = ">=3.10"`)
uv	Package manager used by the project
Git	Cloning repos
git-xet	Required for cloning HF model repos with large files
NVIDIA GPU + Driver 550+	CUDA support for Medium model
Hugging Face account	Private repo access (collaborator required)

Install git-xet

winget install git-xet

Install HF CLI

powershell -ExecutionPolicy ByPass -c "irm https://hf.co/cli/install.ps1 | iex"

Login to Hugging Face

hf auth login
# OR verify existing token:
# Token is stored at: %USERPROFILE%\.cache\huggingface\token

Step 1: Clone the repo and sync dependencies

git clone https://github.com/Stability-AI/stable-audio-3.git
cd stable-audio-3
uv sync --group dev

Step 2: Fix PyTorch — install CUDA version

Problem: uv sync installs CPU-only PyTorch on Windows because pyproject.toml only maps the CUDA index for Linux.

Fix: Reinstall torch + torchaudio with CUDA 12.8:

uv pip install torch==2.7.1+cu128 torchaudio==2.7.1+cu128 --index-url https://download.pytorch.org/whl/cu128 --reinstall

Why cu128 and not cu126? Pre-built flash-attn Windows wheels only exist for cu128. Using cu126 means you'd have to build flash-attn from source on Windows, which requires Visual Studio Build Tools + CUDA toolkit and is painful.

Verify:

.\.venv\Scripts\python.exe -c "import torch; print(torch.__version__, '| CUDA:', torch.cuda.is_available())"
# Expected: torch 2.7.1+cu128 | CUDA: True

Step 3: Install soundfile (torchaudio backend)

Problem: torchaudio ships with zero audio backends on Windows. Without one, generation completes but crashes on torchaudio.save() with:

RuntimeError: Couldn't find appropriate backend to handle uri ... and format None.

Fix:

uv pip install soundfile

Verify:

.\.venv\Scripts\python.exe -c "import torchaudio; print(torchaudio.list_audio_backends())"
# Expected: ['soundfile']

Step 4: Install Flash Attention (Medium model only)

Problem: No official flash-attn wheels for Windows. Building from source requires MSVC + CUDA toolkit setup.

Solution: Use pre-built wheels from kingbri1/flash-attention.

Match your Python version (cp310 = Python 3.10, cp311 = 3.11, etc.):

# Python 3.10 + CUDA 12.8 + torch 2.7
uv pip install https://github.com/kingbri1/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu128torch2.7.0cxx11abiFALSE-cp310-cp310-win_amd64.whl

Verify:

.\.venv\Scripts\python.exe -c "import flash_attn; from flash_attn import flash_attn_func; print('Version:', flash_attn.__version__, '| flash_attn_func:', flash_attn_func)"

Small model users: Flash Attention is optional. The Small model falls back to standard attention automatically.

Step 5: Download the model

hf download stabilityai/stable-audio-3-medium

Or via Python:

.\.venv\Scripts\python.exe -c "from huggingface_hub import snapshot_download; print(snapshot_download('stabilityai/stable-audio-3-medium'))"

The model is ~17 GB and downloads to %USERPROFILE%\.cache\huggingface\hub\.

Step 6: Run

uv run python run_gradio.py --model medium

Opens a Gradio UI with a shareable link. The Medium model uses ~18 GB VRAM on an RTX 4090.

Quick verification checklist

# Run all checks at once:
.\.venv\Scripts\python.exe -c "
import torch
print('torch', torch.__version__, '| CUDA:', torch.cuda.is_available())
import torchaudio
print('torchaudio backends:', torchaudio.list_audio_backends())
import flash_attn
print('flash_attn', flash_attn.__version__)
import stable_audio_3
print('stable_audio_3: OK')
"

Expected output:

torch 2.7.1+cu128 | CUDA: True
torchaudio backends: ['soundfile']
flash_attn 2.8.3
stable_audio_3: OK

Known issues

Issue	Cause	Fix
`torch+cpu` installed by `uv sync`	pyproject.toml CUDA index only mapped for Linux	Reinstall with `--index-url .../cu128`
`torchaudio.save()` crashes with backend error	No audio backend on Windows	`uv pip install soundfile`
Flash Attention won't install	No official Windows wheels	Use kingbri1 pre-built wheels
`hf download` hangs with lock errors	Multiple download processes fighting	Kill all python processes, delete `.cache/huggingface/hub/.locks/...`, retry
Gradio CSS preload warning in browser	Gradio CDN issue, cosmetic	Ignore
Generation uses ~18 GB VRAM	Model + autoencoder + text encoder + activations	Normal for Medium on fp16

VRAM breakdown (Medium model, fp16)

Component	Approx Size
DiT (1.4B params)	~2.8 GB
SAME-Large autoencoder	~2.8 GB
T5Gemma text encoder	~1.5 GB
Activations / KV cache	~5-8 GB
CUDA context + overhead	~2-3 GB
Total	~18 GB

Minimum GPU: RTX 4090 (24 GB) or equivalent.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stable Audio 3 — Windows Setup Guide

Prerequisites

Install git-xet

Install HF CLI

Login to Hugging Face

Step 1: Clone the repo and sync dependencies

Step 2: Fix PyTorch — install CUDA version

Step 3: Install soundfile (torchaudio backend)

Step 4: Install Flash Attention (Medium model only)

Step 5: Download the model

Step 6: Run

Quick verification checklist

Known issues

VRAM breakdown (Medium model, fp16)

FilesExpand file tree

setup-guide.md

Latest commit

History

setup-guide.md

File metadata and controls

Stable Audio 3 — Windows Setup Guide

Prerequisites

Install git-xet

Install HF CLI

Login to Hugging Face

Step 1: Clone the repo and sync dependencies

Step 2: Fix PyTorch — install CUDA version

Step 3: Install soundfile (torchaudio backend)

Step 4: Install Flash Attention (Medium model only)

Step 5: Download the model

Step 6: Run

Quick verification checklist

Known issues

VRAM breakdown (Medium model, fp16)