theDAW User Guide

by GANTASMO

This guide documents theDAW end to end, from generating a piece out of a prompt through arranging it, mixing it, performing it live, and exporting the finished master. The in-app Docs button renders this guide as an interactive modal with Markdown download, print, and PDF export.

Repository Anatomy
Architecture
Installation
Launching the Application
UI Shell
MAKE Tab: Audio Generation
EDIT Tab: Waveform Editor
MIX Tab: Effects and Mastering
DJ Tab: Performance Mixer
VJ Tab: Live Visuals
TRAIN Tab: LoRA and Autoencoder
LEARN Tab: Lineage and Genealogy
Library
Step Sequencer
Piano Roll
Bottom Panel Tabs
Player Footer
Processing Log
Backend API Reference
Python Pipeline Reference
Models
LoRA Adapter Types
Troubleshooting
Development Workflows
Feature Coverage and Screenshot Evidence

1. Repository Anatomy

Layer	Path	Description
ML pipeline	`stable_audio_3/`	Upstream Stability AI code: DiT, SAME autoencoder, all samplers, LoRA parametrization, distribution-shift schedules, T5Gemma conditioner.
FastAPI backend	`backend/server.py`	HTTP wrapper around the pipeline. Async job queue for generation; synchronous endpoints for studio effects and model introspection. Port 8600.
Backend modules	`backend/modules/`	Plugin system. Each subdirectory provides `module.json` (name, API prefix, enabled flag) and `router.py` (FastAPI APIRouter). The loader (`backend/modules/loader.py`) discovers and mounts every enabled module at startup; a failed module is logged and skipped without stopping the server. Ten modules ship in the repo: analysis, chimera, controllervision, effects (mounted at `/api/studio`), library, midi, settings, stems, vj, and ytimport.
React app	`frontend/`	Tailwind 4 + React 19 + Zustand 5 + Vite 6. Multi-tab DAW interface. Proxies `/api/*` to the backend. Port 5173 in development.

2. Architecture

Request flow:

The browser (:5173 in dev, static bundle in production) calls /api/....
The FastAPI backend (:8600) handles HTTP and dispatches generation and effects work.
StableAudioModel composes three parts:
- T5Gemma text encoder (models/conditioners.py)
- DiT diffusion transformer (models/dit.py to models/transformer.py)
- SAME autoencoder (models/autoencoders.py)

Two-stage generation

The DiT generates 256-dimensional latents at 1/4096 the source audio rate. The SAME autoencoder decodes those latents to 44.1 kHz stereo audio. Both stages share the same checkpoint for bundled models. Standalone SAME checkpoints (same-s, same-l) are interchangeable with the bundled versions and reuse the cached full checkpoint when available.

Checkpoint flavors

ARC: post-trained checkpoints (small, medium). Tuned for 8-step inference with cfg_scale=1. Recommended for all generation tasks.
RF: rectified-flow base checkpoints (small-rf, medium-rf). These require around 50 steps and cfg_scale=7 at inference, and serve as the starting point for LoRA training.

Web Audio engine

All in-browser audio (library playback, waveform editor preview, step sequencer, piano roll live notes) routes through a single shared AudioContext graph: source nodes feed a master gain node, which feeds an analyser node, which feeds the destination. The real-time spectral analyzer reads from the analyser node continuously. The player footer volume and mute controls drive the master gain.

3. Installation

Base Python environment

uv sync

Optional Linux CUDA wheels (Medium model)

uv pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu126

With dev dependencies

uv sync --group dev

Windows-specific requirements

pyproject.toml includes CUDA 12.8 wheel sources for torch, torchaudio, and Flash Attention under [tool.uv.sources]. Running uv sync on Windows installs all of them automatically. No additional flags or manual wheel downloads are required for Python 3.10 with CUDA 12.8.

On a different CUDA or Python version, install PyTorch manually:

uv pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu128

soundfile is included in the base dependencies. Flash Attention is conditionally installed (sys_platform == 'win32' and python_version < '3.11'); the wheel URL in pyproject.toml targets Python 3.10 with CUDA 12.8 and torch 2.7.0. For other combinations, download a matching wheel from kingbri1/flash-attention.

Full walkthrough: docs/windows/setup-guide.md.

Frontend dependencies

cd frontend
npm install

package-lock.json is committed. Use npm ci to reproduce the exact dependency tree; npm install is acceptable in development.

4. Launching the Application

One-shot launcher (Windows)

.\start-dev.bat

Sequence:

Kills any stale process on ports 5173 and 8600.
Starts the FastAPI backend (uvicorn backend.server:app --host 0.0.0.0 --port 8600 --reload) in a new terminal window.
Waits 3 seconds for the backend to bind.
Starts the Vite dev server in a second terminal window.
Opens http://localhost:5173 in the default browser.

The --reload flag on the backend triggers an automatic restart when backend/server.py is modified.

Manual launch

# Terminal 1: backend
uv run uvicorn backend.server:app --host 0.0.0.0 --port 8600 --reload

# Terminal 2: frontend
cd frontend && npm run dev

vite.config.ts proxies all /api/* requests to http://localhost:8600, so same-origin fetches require no CORS configuration in development.

5. UI Shell

The application window is divided into three persistent regions:

Left panel (resizable): brand, tab controls content, run CTA, processing log.
Center workspace: the center tab bar at the top, the active workspace below it, and a collapsible bottom multi-tab panel in the global footer.
Player footer: a fixed transport bar across the bottom of the app.

Full-width header: a fixed bar spanning the entire window width. It contains the left-panel collapse/reveal toggle (chevron), the theDAW logo dot, a global search input, and action buttons (Docs, mobile access QR/link, Settings, User avatar, AI Assistant orb).

Left panel resize: drag the vertical handle on the panel's right edge. Range: 300 px to 500 px.

Left panel collapse: click the chevron in the header. The workspace expands to full width. Click again to restore.

Center tab switching: the active workspace is controlled by the center tab bar in the locked order MAKE / EDIT / MIX / DJ / VJ / TRAIN / LEARN (CENTER_TABS). Each tab carries its own accent color. Legacy navigation targets such as create, advanced, edit, and train are translated into these center tabs, so assistant actions, library sends, and older shortcuts still route correctly.

DJ and VJ persistence: the DJ and VJ tabs host live performance state (a multi-deck mixer and an embedded WebGL VJ iframe). Once first visited, they stay mounted and only toggle CSS visibility on tab switch, so deck state and the warm VJ pipeline survive a switch. A backgrounded VJ tab is told to park its render loop, so it costs close to 0% GPU while hidden.

Right-side library rail: the Library is a collapsible right-side panel. Use the library button at the right edge of the center tab bar to expand or collapse it; the rail width persists between 280 px and 640 px. This lets MAKE, MIX, LEARN, VJ, and editor workflows stay visible while selecting or routing library material.

Docs modal: click the Docs button in the header to open this guide in-app. The modal supports anchor links, syntax-highlighted Markdown tables and code blocks, raw Markdown download, and browser print/PDF export. The assistant RAG index is built from the same guide, so doc updates improve in-app help and assistant context together.

Mobile access share: click the QR/link button in the header to copy or scan the current LAN or tunnel URL for mobile performance access. This is useful with the VJ tab and any browser-based controller or viewer.

AI Assistant panel: click the orb icon in the header to open the collapsible assistant panel. It streams chat from any configured LLM provider with RAG context sourced from this user guide, supports file attachments and voice input, and exposes provider and key-pool controls. See §19.11 for the API reference.

Viewport scaling: the UI applies a CSS zoom factor based on viewport width (0.85 below 1440 px; 0.95 at 1440 to 1919 px; 1.1 at 1920 px and above). Shell height calculations compensate so the layout tiles cleanly down to the footer.

6. MAKE Tab

Purpose

The MAKE tab (AdvancedView) submits audio generation jobs to the backend and displays their output. It supports all three inference modes: text-to-audio, audio-to-audio, and inpainting/continuation. It is also where the Chimera engine fuses several sources into one init signal.

6.1 Primary Synthesis / Prompt

Prompt (required). Plain-text description of the desired audio content, instrumentation, or sonic texture.
Negative prompt (optional). Aspects or characteristics to suppress.
Magic prompt button (sparkle icon, bottom-right of the prompt box). Inserts a sample prompt when the prompt field is empty. When text already exists, clicking the sparkles icon sends it to the AI Assistant to optimize the prompt for Stable Audio conditioning.

6.2 Generation Parameters

Six controls arranged in a 3-column grid:

Control	Type	Notes
Model	Dropdown	`small`, `medium`, `small-rf`, `medium-rf`. Selecting an `-rf` variant automatically sets Steps to 50 and CFG to 7.0.
Duration (s)	Integer	Total output length in seconds. Small model: max 120 s. Medium and Large: max 380 s.
Batch	Integer	Number of simultaneous variations. Each variation produces a distinct library entry with its own seed.
Steps	Integer	Sampler denoising steps. ARC default: 8. RF default: 50.
CFG	Float	Classifier-free guidance scale. ARC default: 1.0. RF default: 7.0. Higher values increase adherence to the prompt and can introduce artifacts.
Seed	Integer + reroll button	Use −1 for a random seed on each run. The reroll button generates and displays a new random seed without submitting a job.

6.3 Advanced Generation Panel

The Advanced Generation Panel provides a dense layout for deeper configuration.

Output Settings controls automatic playback and automatic downloading behavior.
Quick Actions route generated audio directly to the Waveform Editor, Init Audio, or Inpainting modules.
Templates Panel saves and restores full generation parameter sets.
Saved Prompts Dropdown maintains a history of frequently used user-defined prompts.
Spectrogram Viewer displays Mel, STFT, Chromagram, and CQT visualizations of generated audio.

Templates and saved prompts are stored in browser local storage for rapid iteration. The prompt sparkle action calls the assistant prompt enhancer (/api/assistant/chat) when text already exists, or inserts a sample prompt when the field is empty.

6.3.1 Chimera Fusion Stack

Chimera fuses two or more audio clips into a single init signal before generation. Source clips can arrive from the Library toolbar and context menu, the Media Bucket, the microphone recorder, or other send targets. This is the engine that turns a folder of loops, a hummed melody, and an imported break into one coherent starting point.

Control	Description
CHIMERA banner	Always-visible status strip at the top of MAKE. Shows the stack count and jumps to the Init Audio card.
Target BPM	`auto` uses the selected base clip BPM, the median detected BPM, or a 120 BPM fallback. A numeric value forces a target.
Base clip	Optional reference clip. When set, its BPM and duration can drive the fusion target.
Noise / influence	Per-clip slider mapped to the backend `weights`; higher noise means less influence on the fused output.
Align mode	`start` aligns clips from time zero. `downbeat` trims to detected first beats. `weave` (the interweave mode) schedules chunks of each clip into an arrangement arc, interleaving them bar by bar with controlled polyphony.
Weave controls	Chunk bars, total bars, and max polyphony shape the generated arrangement when `align_mode=weave`.

Rendering posts to POST /api/chimera/mashup, normalizes inputs to 44.1 kHz stereo, detects tempo and beats, time-stretches the clips, and returns a WAV file plus metadata (target_bpm_used, per-clip stretch ratios, warnings). The fused result becomes the Init Audio for the next generation.

6.4 Init Signal / Conditioning

Audio-to-audio mode. Upload a source file through the dropzone to condition the model on existing audio. Any audio works as an init signal: a generated clip, an imported track, a mic take, a stem, or a Chimera fusion.

Init Noise (0–1) controls the ratio of source signal to random noise injected at the start of the denoising trajectory. Lower values preserve more of the source character; higher values grant the model more generative freedom.
Type is Audio (standard) or RF-Inv (RF-Inversion, meaningful only with -rf model variants).

Removing the source file returns the form to text-to-audio mode.

6.4.1 Microphone Recorder

The browser-side Mic Recorder uses navigator.mediaDevices.getUserMedia plus MediaRecorder to capture live audio without leaving the app. Supported browser encodings are selected in order: WebM/Opus, OGG/Opus, MP4/AAC, then WAV fallback.

After recording, the review card can:

play and pause the take inline;
send it to a new editor track or the first editor track tail;
send it to Init Audio or Inpaint Audio;
import it into the disk-backed Library through POST /api/library/import, so stems, MIDI conversion, lineage, and bundle downloads can run against it later.

6.5 Inpainting / Regen Region

Replaces a defined time window inside a source audio clip while preserving everything outside that window.

Enable toggle (section header) must be active for the inpainting payload to be submitted. It activates automatically when a source file is loaded.
Source dropzone accepts drag-and-drop or click-to-upload.
Waveform preview is rendered by WaveSurfer once a file is loaded.
Region selection: drag horizontally across the waveform to define the regeneration window. The Start, End, and Region Duration readouts update in real time.
Continuation: to extend audio beyond its current end, drag the region to the end of the waveform and set Duration to a value larger than the source length. The model fills the extension conditioned on the existing audio tail.

The form fields mask_start and mask_end are submitted in seconds relative to the start of the inpaint audio file.

6.6 LoRA / Adaptive Layers

Stack one or more LoRA adapters for the next generation. Each adapter row shows its name, a weight slider (0 to 1), and a remove button.

Current status: UI scaffolding. The /api/generate-jobs endpoint does not yet forward LoRA references to the pipeline. LoRA at inference is supported directly through the Python API (see §20.5).

6.7 Output Status Monitor

Appears below the accordion after a job is submitted or completed.

Binary progress bar (queued to running to completed).
Engine information chip (model, steps).
Inline audio player for the completed result.
Download and clear buttons.
Every completed result is auto-saved to the Library.

6.8 Run Generation

A sticky bar fixed at the bottom of the left panel submits the generation job to POST /api/generate-jobs. While a job is active, the button changes to a red ABORT button showing an estimated percentage. Clicking Abort cancels the polling loop; the backend job keeps running, but its result is discarded by the UI.

7. EDIT Tab

Purpose

The EDIT tab (WaveformEditor) is the multi-track audio composition surface. Clips are slices of source audio Blobs placed on a pixel-per-second timeline. Editing is non-destructive at the Blob level; rendering happens on demand through OfflineAudioContext.

7.1 Toolbar

Control	Description
ADD TRACK	Appends an empty track. Track names auto-inherit from the first clip placed on them and stay editable.
Move tool	Default tool. Drag clips horizontally to reposition in time; drag vertically to move between tracks.
Cut tool	Click inside a clip to split it at that position. The right half becomes a new clip referencing the same source Blob with an adjusted `offsetIntoSource`, preserving source alignment.
Snap	Off / 1/4 / 1/8 / 1/16. Quantizes drag and resize operations to note-grid intervals relative to the editor BPM.
Zoom in / out	Adjusts timeline resolution in pixels-per-second. Range: 5 to 400 px/s.
Delete	Removes the selected clip. The Delete and Backspace keys do the same when a clip is selected.
Mixdown name	Text input. Sets the output filename on COMMIT EDIT. If empty, the committed file defaults to `mixdown_<id>.wav`. The `.wav` extension is appended automatically when omitted.
COMMIT EDIT	Renders the composition to a WAV file. See §7.6.

7.2 Per-track Controls

Each track has a fixed-width header to the left of the timeline lanes:

Control	Description
Name	Editable text. Click to edit in place. Color matches the track's assigned hue.
M (Mute)	Silences the track in both preview and final render.
S (Solo)	Exclusive across all tracks: soloing a track mutes all others. Soloing a second track adds it to the active set.
× (Remove)	Deletes the track and all clips assigned to it.
Volume slider	Per-track gain, 0 to 1. Applied in both preview and offline render.
Pan slider	Per-track stereo position, −1 (full left) to +1 (full right). Applied through a `StereoPannerNode` in the offline render.

7.3 Per-clip Display and Handles

Each clip renders a downsampled waveform (240 bins, normalized) as a visual background, with a header label and a duration readout.

Resize handles are thin strips at the left and right edges. Dragging the left handle adjusts both startSec (timeline position) and offsetIntoSource (in-point into the source audio) together, so the audio content at the playhead position stays constant. Dragging the right handle adjusts only durationSec (out-point).

Fade handles are vertical semi-transparent lines with a circular grab point near the bottom edge. The fade-in handle sits inset from the left edge; dragging it right extends the fade-in duration. The fade-out handle sits inset from the right edge; dragging it left extends the fade-out duration. Both fades apply as linear gain ramps in the offline render and as gain.linearRampToValueAtTime automation during preview.

7.4 Inpainting from the Editor

Regenerate a sub-region of any clip with the model while keeping the surrounding audio intact.

Select a clip.
Switch to the Paintbrush tool (or right-drag inside a clip while in Move mode) to draw an inpaint selection region. The selection shows as a highlighted band across the clip.
Click INPAINT REGION to open the inpaint panel.
Set the prompt, number of steps, and seed.
Click Generate to submit an async job.

The submission process:

The visible clip region (from offsetIntoSource to offsetIntoSource + durationSec) is extracted from the source Blob as a cropped WAV through cropAudioBlob. This keeps behavior correct for trimmed or split clips regardless of their in-point.
Mask coordinates are computed relative to the start of the cropped audio: maskStart = selection.startSec − clip.startSec, maskEnd = selection.endSec − clip.startSec.
The cropped WAV and mask are sent to POST /api/generate-jobs.

On completion, the result enters a Review phase:

An inline audio player auditions the regenerated region.
Accept replaces the clip's source Blob with the result and clears the inpaint selection.
Discard dismisses the panel without modifying the clip.

7.5 Timeline Interaction

Playhead: the red vertical line. Click any empty area in the timeline to position it. The footer transport's skip-to-start button resets it to 0.
Deselect: click empty space in a track lane to deselect the active clip.
Scroll: horizontal scroll on the timeline area moves the view.
Status bar: shows live timecode as MM:SS.cs / MM:SS.cs (current / total), clip count, track count, and the selected clip's start-to-end range.

7.6 COMMIT EDIT

Renders the complete composition to a single 44.1 kHz stereo WAV.

Render process:

All non-muted (or exclusively soloed) tracks are collected.
Clip source Blobs are decoded in a temporary AudioContext (one decode per unique Blob, cached for clips sharing a source). A 15-second timeout guards against stalled decode operations.
An OfflineAudioContext is initialized at 44.1 kHz stereo with a duration equal to the total timeline length.
Each clip is scheduled through BufferSourceNode.start(clipStartSec, offsetIntoSource, durationSec). Per-track volume and stereo pan are applied. Fade-in and fade-out gain envelopes are automated with linearRampToValueAtTime.
OfflineAudioContext.startRendering() produces the final AudioBuffer.
A WAV Blob is encoded as 16-bit PCM (encodeWav).
The Blob is saved to the Library (source: 'editor-mixdown').
A browser file download is triggered automatically.

During the render, the COMMIT EDIT button shows an animated spinner and is disabled.

8. MIX Tab

Purpose

The MIX tab (MixPanel) is the single-screen effects and mastering workspace. It applies one or more of 24 FFmpeg-backed processors to a source file as an ordered chain. Processing is synchronous; the result is returned as binary audio and made available for inline playback, download, and routing back into the app.

8.1 Layout

MIX fills the viewport with no page scroll, in three bands:

Top: the source field. Click or drop audio anywhere in the waveform field to load or replace it. The header strip shows the filename plus peak, RMS, sample-rate, and duration stats.
Middle: three columns. The left rail holds effect categories and the Quick Master controls. The center column is the effect library. The right column is the active chain.
Bottom: the output waveform, the effects visualization region, and the scope and cymatics visualizer reacting to the shared player master gain.

The player footer is the process-chain transport.

8.2 Quick Master

Quick Master exposes the four parameters of a mastering_chain entry as live knobs (low boost, high boost, limiter ceiling, target LUFS). Applying it inserts a single mastering_chain entry into the chain once, then live-updates that entry's parameters as the knobs move.

8.3 Effect Catalog and Chain

The effect library is organized into categories (mastering, dynamics, EQ, tempo, cleanup, export) with list and tile view modes and color-coded groups. Add effects to the chain, toggle individual rows active or inactive, reorder rows, and clear the chain in one action. Each chain entry exposes bounded sliders and numeric fields sourced from PARAM_BOUNDS and EFFECT_CATALOG.

Effect key	Description	Macro-derived parameters
`mastering_chain`	EQ, limiting, and LUFS normalization	lowBoost, highBoost, limiterCeiling, targetLUFS
`compression`	Dynamic range compression	attack, decay
`highpass`	High-pass filter	cutoff frequency
`lowpass`	Low-pass filter	cutoff frequency
`volume`	Gain adjustment	output level
`tempo`	Time-stretch and playback rate	rate
`vocal_processing`	High-pass, presence boost, and LUFS normalization	highpassFreq, presenceBoost, targetLUFS
`lofi_vinyl`	Bit degradation and low-pass	degradation, lowpassFreq
`stereo_widener`	Haas-effect widening	delayMs
`reverb_delay`	Combined reverb and delay	delayMs, decay, reverbDecay
`sub_exciter`	Sub-bass and treble harmonic excitation	subBoost, trebleBoost
`phase_isolation`	Mid-side phase cancellation	cancelAmount
`eq_mid`	Single-band parametric EQ	frequency, width (Q), gain
`loudnorm`	ITU-R BS.1770 loudness normalization	targetLUFS, truePeak
`pitch_shift`	Semitone pitch transposition	shift (cents)
`delay`	Stereo delay	leftMs, rightMs
`echo`	Echo with feedback	delayMs, decay
`fade`	Linear fade-in and fade-out	fadeInDuration, fadeOutDuration
`denoise`	Noise floor reduction	noiseReduction
`declick`	Impulse noise removal	windowSize
`silence_remove`	Leading and trailing silence trimming	threshold
`export_flac`	Lossless FLAC export	compressionLevel
`export_mp3`	Lossy MP3 export	bitrate
`export_aac`	AAC export	bitrate
`export_opus`	Opus export	bitrate

All effects are dispatched to ffmpeg through subprocess.run. Server-side bounds checks apply to every parameter; out-of-range values return HTTP 400.

8.4 Source, Output, and Routing

Source loads through the top field by click, drop, or a Library drag using the application/x-stabledaw-library-id transfer protocol, so a persisted track becomes the source without a manual download.
Output format selector: wav, flac, ogg, mp3, aac, or opus.
Process submits to POST /api/studio/process. The binary audio response is wrapped in a Blob URL for inline playback.
The output result can be downloaded, sent to the EDIT timeline, or sent to Inpaint Audio in MAKE.

8.5 Process History

The last processing invocations are retained in the store. Any history item can be selected and promoted back to the current source.

9. DJ Tab

Purpose

The DJ tab (DJView) is a live two-deck performance mixer built on the SLIDE control surface. It loads tracks from the Library or any source, beatmatches and key-locks them, mixes through a full center console, and hands sets off to the VJ for synchronized visuals. The Web Audio engine (djEngine.ts) runs the signal path per deck: source, optional time-stretch, delay compensation, trim, three-band EQ, filter, gain and crossfade, the DJ master bus, then the shared player master.

9.1 Waveform Hero and Decks

The two deck waveforms span the full width at the top, each with a beatgrid, hotcue markers, and a moving playhead. A thin per-deck header shows the title, detected BPM, musical key and Camelot value, and the SYNC control. Below the hero, Deck A and Deck B flank the center mixer.

9.2 Per-deck Transport and Cueing

Each deck provides play and cue, four hotcues, loops with loop-roll and slip mode, beat-jump, and a key-lock toggle. SYNC engages a phase-locked beatmatch against the other deck and can hold the lock as either deck drifts. Key-lock holds musical pitch constant while tempo changes, so a pitch-shifted track stays in key. Quantize snaps cue and loop actions to the beatgrid.

9.3 Center Mixer

The mixer is the console between the decks. Each channel has a gain knob, a three-band EQ (high, mid, low), a single-knob filter that sweeps low-pass below center and high-pass above center, and a volume fader. The two pitch faders sit as the outer columns of the mixer. A crossfader runs along the bottom, and the toggle row (Quantize, Auto-gain, Limiter, MIDI) sits above the gain knobs. A master limiter protects the DJ master bus.

9.4 Live Stems and FX

Each deck has an FX and stems rack. Live stem faders ride the separated drums, bass, other, and vocals derived from Demucs, so a part can be pulled out or pushed up during playback. The FX rack adds per-deck effects on top of the EQ and filter.

9.5 Cue / Headphone Output

Per-deck pre-listen routes a deck to a separate output device through setSinkId, so a deck can be auditioned in headphones while the other plays to the main output.

9.6 MIDI Learn

DJ MIDI-learn binds a hardware controller to deck, mixer, and hotcue actions. It reuses the controller mapping store, so a learned layout persists. Controller recognition runs in tiers: a profile library with scored auto-detect, learn-by-capture for any rig, and a planned photo-layout inference path.

9.7 Automix, Sampler, and Side List

Automix sequences a setlist hands-free, beatmatching each transition.
Sampler bank: drag a clip onto a pad to load a one-shot, then trigger pads during a set.
Side List: a play-next staging lane above the browser. Stage upcoming tracks, reorder them, and pull them onto a deck when ready.

9.8 Browser and Source Tree

The browser loads tracks onto a deck by drag or click. The source tree exposes real filtered views (Library, Favorites, Generated, Imports) with live counts, plus an Online Download source and saved Sets. Send to VJ pushes the current set or a single track to the VJ archive.

9.9 Design Mode

A floating Edit Layout control turns on Design Mode. In Design Mode, panels and the mixer control groups can be dragged to reorder and dragged at their borders to resize, snapped to a grid. The layout persists across sessions, a Reset restores the default, and a Copy action exports the layout as JSON to bake in as the new default.

10. VJ Tab

Purpose

The VJ tab (VJView) embeds the GANTASMO-LIVE-VJ visual engine as a live, audio-reactive instrument. The backend vj module spawns the engine's dev server lazily, and the tab fetches its live URL from /api/vj/url, so the port is never hardcoded. First launch runs npm install in the VJ project and can take a minute; later launches are fast.

10.1 Inputs

A toolbar row toggles which signals feed the visuals. At least one input stays active at all times.

Mic: the VJ engine captures microphone input directly and requests browser permission on first use.
Audio: an audio bridge reads theDAW master analyser every animation frame, derives bass, mid, high, and volume buckets, and posts them to the engine at around 30 fps. The chip shows the live bridge frame rate.
MIDI: controller events from the global MIDI bus are forwarded into the engine.
Camera: a toggle switches the visual source between the live webcam and the clip or memory buffer. The button reflects the real source through a state echo and shows any camera error.

A master MIDI gate turns Web MIDI on or off for the whole app. When off, the app never requests Web MIDI access, so no browser permission prompt appears.

10.2 Pop-out and Mobile

Pop out opens the visuals in a separate window. Drag it onto a second monitor for live performance while theDAW keeps running on the main display. Pop back in returns it to the tab. Closing the window manually snaps back automatically.
Mobile: when the machine has a LAN address, the Mobile button shows a copyable URL and a QR code so a phone or tablet on the same Wi-Fi can open the visual output.

10.3 Bridges

The tab wires several bridges to the engine through postMessage:

Playback: the player footer Play and Pause buttons drive the engine's video element, and the engine echoes its state back so the footer icon stays in sync.
SET hand-off: a set or single track pushed from the DJ tab arrives in the engine's archive bucket. The toolbar shows a pending state, then confirms once the engine acknowledges the set.
Control sync: the engine publishes its control manifest, which appears as glass-capsule faders in the SLIDE bottom-panel tab. Moving a SLIDE fader updates the engine, and a move inside the engine updates SLIDE, in both directions.
Track metadata: the current track title, model, source, duration, and play state are posted so the engine can sync its own readouts.
Visibility: when the tab is hidden, the engine parks its render loop, so a warm but backgrounded VJ tab costs close to 0% GPU.

10.4 Export

The visual engine records performances and the backend transcodes the recording to the chosen codec. The saved path is reported back and surfaced in the Processing Log. Exports are written to the local exports/ folder and are never committed to the repository.

11. TRAIN Tab

Purpose

The TRAIN tab (TrainingView) is the interface for LoRA fine-tuning configuration and autoencoder round-trip validation. Some endpoints are fully implemented in the backend; others are stubs in this fork pending integration.

11.1 Target Architecture

Field	Description
Module name	Output checkpoint filename label.
Target module	Which submodule to attach LoRA to (`attn_kv` is the default and most-tested).
Epochs / Steps	Training step budget.
Rank	LoRA rank. Controls the number of trainable parameters per layer.
Alpha	Scaling factor. Effective update scale is alpha divided by rank. Setting alpha equal to rank gives a scale of 1.0.
Dataset path	Server-side filesystem path containing audio files and paired text prompts.

11.2 Pre-encode Workflow

Pre-encoding a dataset to latents before training accelerates iteration. Submit a dataset path and output path; the backend job runner calls pre_encode.py against the dataset.

Backend status: stub, returns HTTP 501. Use python -m stable_audio_3.scripts.pre_encode directly until this is wired.

11.3 Autoencoder Round-trip

Upload an audio file; the backend encodes it to base64-serialized latents, then decodes them back to audio for reconstruction quality verification.

Backend status: /api/autoencoder/info returns an empty list, so the TRAIN tab displays "no autoencoders available." Encode and decode endpoints return HTTP 501. The frontend intercepts these responses and shows specific messages in the UI instead of generic network errors.

11.4 Job Polling

Long-running training jobs are tracked through GET /api/jobs/{id}, polled at 1-second intervals. The logs field of the response is rendered as streaming console output.

12. LEARN Tab

Purpose

The LEARN tab (LineageView) visualizes the genealogy of every asset in the Library. The backend tracks parent and child relationships across generated, imported, stem, MIDI, and Chimera-derived assets, and LEARN renders that data as interactive graphs. This is the data-visualization view onto how a piece came to be, from its source clips through every transformation.

12.1 Views

LEARN has three views, selected from the header:

Track: the lineage tree rooted at one selected entry, showing its direct ancestry and descendants.
Genealogy: the full library-wide family tree in 2D.
3D graph: the complete genealogy as a force-directed WebGL graph, including virtual nodes for stems, MIDI files, and external source labels.

12.2 3D Graph Controls

The 3D graph has an Appearance panel with coordinated visual presets (for example particle cloud and galaxy), each swapping a bundle of node, edge, and force settings. A fullscreen toggle expands the graph to the full viewport, and a footer legend maps edge colors to relationship kinds (generated, imported, stem, MIDI, Chimera). The graph data comes from GET /api/library/_graph/all, and a single-entry tree comes from GET /api/library/{entry_id}/lineage.

13. Library

Purpose

The Library is persistent storage for generated, imported, processed, recorded, and fused audio. It is a collapsible right-side rail available across the workspace. The default provider is backend-local: metadata is mirrored through /api/library/*, audio is stored on disk under data/generations/, and playback uses range-streamed file responses so large tracks scrub without loading the full file into memory. The frontend keeps transient Blob caches for efficient repeated use within a session.

13.1 Automatic Entry Creation

Every successful generation in the MAKE tab produces one or more library entries. Batch jobs produce N entries with IDs ${jobId}_0, ${jobId}_1, and so on.

Each entry stores:

{
  id: string,
  title: string,
  prompt: string,
  negativePrompt: string,
  model: string,
  duration: number,       // seconds
  steps: number,
  cfg: number,
  seed: number,
  audioBlob: Blob,
  mimeType: string,
  timestamp: string,      // ISO 8601
  favorite: boolean,
  rating: number | null,
  tags: string[],
  notes: string,
  source: 'generate' | 'editor-mixdown' | 'bucket'
}

Waveform editor mixdowns, MIX outputs, mic recordings, imports, and Chimera renders can also be saved here. Server records expose audio_url, audio_filename, file_size_bytes, mutable favorite, rating, tags, notes, and optional chimera_sources fields. User edits are persisted with PATCH /api/library/entries/{id}.

13.2 List and Grid Views

Toggle between a dense List view (one row per entry) and a Grid view (tile cards) through the icons in the section header. List view shows title, prompt preview, model chip, duration, date, file size, and a per-entry action cluster.

13.3 Search, Filter, Sort

Search filters across title, prompt, model, tags, and notes at the same time.
FAVS toggle restricts the view to favorited entries.
Sort by Newest (timestamp descending), Duration (longest first), or Title (alphabetical).

13.4 Per-entry Controls

Control	Description
Play / Pause	Loads and plays the entry through `playerStore`. Pausing one entry while another is active stops playback globally.
Favorite star	Toggles the favorite flag; persisted to the backend immediately.
Download	Triggers a browser file download of the audio.
Delete	Removes the entry from the backend store and the in-memory store.
Scissors icon	Decodes the audio, computes 240-bin waveform peaks, and appends the clip to the first available waveform editor track. If no tracks exist, a new track is created.
Row click	Selects the entry; the Details panel in the bottom tab bar reflects the selection.

Right-clicking an entry opens the per-row context menu. Depending on installed modules and selected entry state, it can send to Init, send to Inpaint, run analysis, separate stems, convert to MIDI, download a bundle, show lineage, or delete the entry.

13.5 Bundle Downloads and Lineage

The Library exports a self-contained ZIP bundle from GET /api/library/{entry_id}/bundle. Bundles include the audio file, metadata, analysis rows, stem rows, MIDI rows, and lineage edges when available.

Lineage endpoints build parent and child graphs for generated, imported, stem, MIDI, and Chimera-derived assets:

GET /api/library/{entry_id}/lineage?depth=4 returns nodes and edges around one entry.
GET /api/library/_graph/all returns the complete local genealogy graph, including virtual nodes for stems, MIDI files, and external source labels.

The LEARN tab visualizes this graph in 2D and 3D, with appearance options, fit and reset controls, and node details.

13.6 Stem Separation

The stems module mounts at /api/stems and runs a sidecar-backed separation pipeline. Library entries can split into 2, 4, 6, or 12 stems; progress is persisted in the library database, and stem audio is served through GET /api/library/stems/{stem_id}/audio for editor, init, and inpaint routing.

Important endpoints:

GET /api/stems/probe checks tool availability without spawning the sidecar.
POST /api/stems/install installs integration dependencies into the configured Python environment.
POST /api/stems/start and POST /api/stems/stop manually control the sidecar.
POST /api/stems/{entry_id}/run separates one library entry.
GET /api/stems/{entry_id}/progress polls phase, message, progress, and task_id.
POST /api/stems/{entry_id}/abort requests cancellation at the next poll tick.

13.7 MIDI Conversion

The MIDI module mounts at /api/midi and converts full tracks or separated stems to Standard MIDI Files. It supports installable engines such as basic_pitch and piano_transcription_inference. A converted file can flow straight into the Piano Roll or Step Sequencer, so a generated or imported track becomes editable notes.

Important endpoints:

GET /api/midi reports capabilities and available engines.
POST /api/midi/install?engine=basic_pitch installs a conversion engine.
POST /api/midi/{entry_id}/run?from_stems=true runs conversion for one entry.
GET /api/midi/{entry_id} lists MIDI rows for an entry.
GET /api/midi/file/{midi_id} streams .mid bytes for import into Piano Roll or Step Sequencer.

13.8 Library Analysis

A stats footer shows the total entry count, the favorites count, cumulative storage size, and cumulative playback duration.

13.9 Empty State

Shown until the first generation. It contains a Go generate something button that switches the active workspace to MAKE.

14. Step Sequencer

Purpose

A 16-step drum machine driven by a BPM clock, with five synthesized voice types. All audio routes through the shared Web Audio engine. It is available as the Sequence tab in the bottom panel.

14.1 Transport

Control	Description
Tempo (BPM)	Clock rate, 40 to 240 BPM. Each step represents a 16th note.
Play / Stop	Starts and stops the step clock. The first press unlocks the `AudioContext` (browser autoplay policy).
Random Fill	Randomizes the step pattern for every track at once.
Clear	Sets all step buttons to off for all tracks.
Add Track (+)	Appends a new track with a synthesized voice.

14.2 Voice Synthesis

Each track's voice is synthesized on the fly through the Web Audio API:

Voice	Synthesis method
kick	Pitched sine oscillator with a rapid frequency sweep (pitch drop) and an exponential gain decay.
snare	Bandpass-filtered noise burst combined with a short tonal sine body.
hat	White noise through a high-frequency bandpass filter with a short gain decay.
tone	Sawtooth oscillator through a low-pass filter with an ADSR envelope. Frequency set by the track's `freq` parameter.
noise	White noise through a frequency-dependent low-pass filter with a short decay.

All voices share the same triggerVoice function, which accepts a BaseAudioContext and a destination node. The same synthesis code drives both live playback and offline rendering.

14.3 Per-track Controls

Control	Description
Name	Editable text label.
Voice chip	Displays the current voice type. Click to cycle through `kick`, `snare`, `hat`, `tone`, `noise`.
Volume slider	Per-track gain, 0 to 1.
Step buttons (16)	Toggle individual steps on and off. Buttons on beats 1, 5, 9, 13 carry a distinct visual emphasis.
Preview (target icon)	Hover-revealed. Triggers the voice once without starting the clock.
Remove (trash icon)	Hover-revealed. Deletes the track.

14.4 Send to Editor

Renders the current pattern offline to a WAV Blob through OfflineAudioContext, then appends it to the waveform editor as a new clip on a new track. The clip's source kind is set to 'audio'.

14.5 MIDI Export

The sequencer exports Standard MIDI Files:

Single mixed track writes all active voices into one MIDI track.
One track per voice writes each sequencer lane as a separate MIDI track.
Bars to render controls the offline audio render length when sending to the editor.

These exports use the same PPQ timing constants as live playback (PPQ = 480, one 16th note equals PPQ / 4), so MIDI files line up with the sequencer grid.

15. Piano Roll

Purpose

A MIDI-style note editor for melodic and harmonic content. Notes are placed on a chromatic grid, rendered to audio through a sawtooth-and-filter synthesizer, and exported as MIDI or sent to the waveform editor. It is available as the Piano tab in the bottom panel.

15.1 Grid and Keyboard

A vertical chromatic keyboard (MIDI notes 0 to 127, configurable visible range) occupies the left axis. The horizontal grid represents time in 16th-note steps at the configured BPM. Black and white keys are visually distinguished; note labels appear on C notes.

15.2 Note Editing

Action	Result
Click an empty grid cell	Places a new note at that pitch and step.
Drag a placed note horizontally	Repositions the note in time.
Drag the right edge of a note	Resizes the note's duration.
Click a placed note	Selects it.
Delete / Backspace	Removes the selected note.

15.3 Playback

The Play button starts a step-based clock that advances currentStep and triggers each note whose step index matches, routed through the shared engine AudioContext and master gain node. Output is audible in the spectral analyzer. Stop halts the clock.

15.4 BPM and Grid Length

BPM sets the tempo for playback and offline render. Range: 40 to 240.
Total Steps defines the loop length in 16th-note steps. Longer values extend the grid horizontally.

15.5 MIDI Import and Export

Import MIDI parses a .mid file through parseMidi and replaces the current note list with the imported content. Tempo and note mappings are preserved.
Export MIDI serializes the current note list to a standard MIDI file through downloadMidi and triggers a browser download.

15.6 Send to Editor

Renders the current note pattern to a 44.1 kHz stereo WAV through OfflineAudioContext. The rendered clip is appended to the waveform editor on a new track. The clip's metadata records sourceKind: 'piano-roll', the note list, BPM, and grid length, so it can be re-opened for editing.

15.7 Edit in Piano Roll

Clips in the waveform editor whose sourceKind is 'piano-roll' display an Edit in Piano Roll action. Triggering it loads the clip's stored note list and BPM into the piano roll store and switches the bottom panel to the Piano tab.

16. Bottom Panel Tabs

The bottom panel is collapsible and vertically resizable (drag the grip handle above it), and a maximize toggle expands any tab to fill the window. Six tabs are available.

16.1 Visualize: Real-time Spectral Analyzer

Live visualization of the shared Web Audio engine's output, reading from the engine analyser node continuously through requestAnimationFrame.

Display modes (O / S / R buttons, vertical column, top-left of the canvas):

Mode	Description
Oscilloscope (O)	Time-domain waveform drawn with a purple glow. Amplitude on the vertical axis; sample index on the horizontal.
Spectrum (S)	Frequency-domain bar chart. Bins are log-scaled; bar height represents magnitude. Gradient from deep purple at the base to lavender at the peak. Bars are constrained below the status overlay to prevent visual overlap.
Radial (R)	Frequency data mapped to a polar shape, drawn as a closed path centered in the canvas.

Status overlay (bottom of the canvas, gradient backdrop):

Sample rate (kHz) and FFT size.
RMS level in dBFS (sampled every 5 animation frames), marked with a yellow lightning icon.
Peak level in dBFS, marked with a green target icon.
LIVE or SILENT state (signal threshold −60 dBFS). LIVE animates with a purple pulse.
Settings and fullscreen toggle buttons.

Text in the overlay uses textShadow for legibility against any visualization behind it.

Canvas scaling: the canvas is sized to its container in physical pixels (device pixel ratio capped at 2×) through a ResizeObserver. The style dimensions are set in CSS pixels, so the canvas stays crisp on high-DPI displays.

16.2 Piano

Full piano roll interface embedded in the bottom panel. See §15.

16.3 Sequence

Full step sequencer embedded in the bottom panel. See §14.

16.4 Details

Displays full metadata for the currently selected library entry.

Field	Displayed value
Title	Entry title
Prompt / Negative prompt	Generation text
Model	Model key
Duration	MM:SS.ms
Steps / CFG / Seed	Generation parameters
Timestamp	Locale date and time
File size	Bytes / KB / MB
Tags / Notes	Editable per-entry
Source	`generate`, `editor-mixdown`, or `bucket`

Actions:

Audition in engine loads the entry into playerStore and begins playback.
Send to editor decodes waveform peaks and appends the entry to the waveform editor as a new clip.
Download triggers a browser file download of the audio.

16.5 Media

A session-scoped file holding area for arbitrary audio files. Contents are cleared on page reload.

Dropzone accepts drag-and-drop or click-to-upload. Supported formats: WAV, MP3, FLAC, OGG, AAC, M4A, Opus. Library entries can be dragged in directly using the application/x-stabledaw-library-id transfer protocol to locate the source file from the backend store.
Per-item display: filename, MIME type, file size.
Send to Editor decodes peaks and appends the item to the waveform editor as a new clip on a new track. Non-audio files are rejected with a log entry.
Send to Library decodes the audio, measures its duration, and creates a persistent entry with source: 'bucket'.
Remove removes the item from the bucket. The library and editor are unaffected.
Clear all removes every item at once.

16.6 SLIDE

The SLIDE tab is the glass-capsule control surface that mirrors the VJ engine's control manifest as faders. Moving a SLIDE fader updates the matching control in the VJ, and moving a control inside the VJ updates SLIDE. A content toggle switches which control set is shown, a detach button pops SLIDE out into its own window for a second monitor, and the shared maximize toggle expands it to fill the panel.

17. Player Footer

A fixed bar at the bottom of the viewport (z-index 50), visible and functional across all tabs and workspace modes.

17.1 Track Information (left region)

Thumbnail: an animated music-note icon with a purple pulse during playback.
Title: the current playerStore.currentLabel, truncated to fit the column.
Model chip: derived from useGenerateStore.lastModelName. Shows LIBRARY for entries loaded from the library and IDLE when nothing has been generated or loaded.
Duration readout: MM:SS // 48kHz, read from playerStore.duration.

17.2 Transport (center region)

Control	Description
Loop	Toggles looped playback in `playerStore`. Active state shown in purple.
Skip to start	Calls `playerStore.seekByFraction(0)`.
Play / Pause	Primary playback toggle. In EDIT workspace mode with no editor audio loaded, the first press triggers an offline render of the waveform editor timeline, loads the result into `playerStore`, and begins playback. Subsequent presses toggle playback natively.
Skip to end	Calls `playerStore.seekByFraction(1)`.
Fullscreen	Toggles browser fullscreen on `document.documentElement`.

Progress bar: a horizontal track showing playback position. Click anywhere to seek (playerStore.seekByFraction). On hover, a circular scrubber handle appears at the current position. Time labels at left and right show current time and total duration. The footer synchronizes its playhead with the Waveform Editor; scrubbing the progress bar updates the editor timeline position when the editor timeline is the active audio source.

17.3 Utilities (right region)

Mute toggle switches the playbackStore mute flag. The volume icon changes to a red VolumeX when muted.
Volume slider: an overlay <input type="range"> drives playbackStore.volume (0 to 100). The visual fill scales proportionally. The combined volume × !muted value is forwarded to playerStore.setMasterGain, which drives the shared Web Audio master gain node.
Download retrieves the library entry whose id matches playerStore.currentEntryId and triggers a browser file download.
More is decorative.

18. Processing Log

A collapsible message log pinned in the global footer alongside the bottom panel.

Producers

Source	Events
`system`	Application startup, model load status.
`health`	Backend connectivity check results.
`generate`	Job submission, progress, completion, errors.
`training`	LoRA job updates (when wired).
`studio`	MIX processing results and errors.
`sequencer`	Sequencer start, stop, and render events.
`library`	Backend save and load events.
`vj`	VJ export results and bridge errors.

Log Levels

Each entry carries a severity level shown as a colored left-border indicator:

Level	Color	Usage
`info`	Purple	Normal operational events.
`warn`	Amber	Non-fatal anomalies.
`error`	Red	Failures that prevented an operation from completing.
`debug`	Gray	Verbose internal state for development.

Controls

Ring buffer capacity: 500 entries. The oldest entries are discarded when the cap is reached.
Auto-scroll: the log panel scrolls to the most recent entry automatically.
Download: exports the full buffer as stabledaw-log-YYYYMMDD-HHMMSS.txt. Each line contains an ISO timestamp, level, source, and message.
Clear: wipes the ring buffer.
Collapse/expand: click anywhere on the header bar. The collapsed state shows the entry count and a "click to expand" hint.

19. Backend API Reference

All endpoints are served under /api/* on port 8600. Error responses use {"detail": string} (FastAPI default) or {"error": string}. The SA3_DEBUG_ERRORS environment variable (set to "1") enables additional detail fields in error responses.

19.1 Health

GET /api/health

Response when healthy:

{ "status": "ok", "model_loaded": true }

Response when degraded (HTTP 503):

{ "status": "degraded", "model_loaded": false, "error": "MODEL_LOAD_FAILED" }

19.2 System Statistics

GET /api/system-stats

Returns a JSON object describing current hardware utilization. This endpoint invokes nvidia-smi and polls psutil to stream GPU VRAM usage, GPU temperature, GPU utilization, CPU usage, and RAM usage.

19.3 Model Info

GET /api/model-info

{
  "active_model": "medium",
  "available_models": ["medium"],
  "sample_rate": 44100,
  "diffusion_objective": "rectified-flow",
  "has_cuda": true,
  "device": "cuda:0",
  "vram_used_gb": 4.2,
  "vram_total_gb": 6.0
}

19.4 Generation, Async (theDAW UI)

Submit:

POST /api/generate-jobs   multipart/form-data

Field	Type	Default	Description
`prompt`	string	required	Generation text prompt.
`negative_prompt`	string	`""`	Aspects to suppress.
`model_name`	string	`"medium"`	Model key.
`duration`	float	`30.0`	Output length in seconds.
`steps`	int	`8`	Denoising steps.
`cfg_scale`	float	`1.0`	Classifier-free guidance scale.
`seed`	int	`-1`	−1 for random.
`batch_size`	int	`1`	Number of simultaneous variations.
`init_noise_level`	float	`1.0`	Audio-to-audio conditioning strength.
`init_audio_type`	string	`"Audio"`	`"Audio"` or `"RF-Inv"`.
`file_format`	string	`"wav"`	Output file format.
`mask_start`	float	`0.0`	Inpaint region start in seconds.
`mask_end`	float	`0.0`	Inpaint region end in seconds.
`init_audio`	file	optional	Source audio for audio-to-audio mode.
`inpaint_audio`	file	optional	Source audio for inpainting.

Response:

{ "job": { "id": "<uuid>" } }

Poll:

GET /api/jobs/{job_id}

{
  "id": "...",
  "kind": "generate",
  "status": "queued | running | completed | failed",
  "progress": { "step": 4, "steps": 8 },
  "result": {
    "batch": false,
    "item": { "audio_base64": "...", "mime_type": "audio/wav", "filename": "..." }
  },
  "error": "..."
}

Batch results use "batch": true with an "items" array instead of a single "item".

19.5 Spectrogram Generation

GET /api/spectrogram/{job_id}
GET /api/spectrogram/{job_id}/{index}

Retrieves cached spectrogram output to avoid redundant generation overhead.

19.6 Generation, Synchronous

POST /api/generate   multipart/form-data

The same surface as /api/generate-jobs plus advanced parameters: sampler_type, apg_scale, sigma_max, cfg_rescale, cfg_norm_threshold, cfg_interval_min, cfg_interval_max, all distribution-shift parameters (dist_shift_type, logsnr_*, flux_*, full_*), and RF-Inversion parameters (inversion_steps, inversion_gamma, inversion_unconditional).

Response: a raw binary audio body (audio/wav by default). This endpoint runs synchronously and exposes the full advanced parameter set.

19.7 Studio Processing

POST /api/studio/process   multipart/form-data

Field	Type	Description
`audio`	file	Input audio. Required.
`effect`	string	One of the effect keys from §8.3.
`params`	string	JSON-serialized `Record<string, number>`.
`output_format`	string	`wav`, `flac`, `ogg`, `mp3`, `aac`, or `opus`.

Response: a binary audio body with the matching Content-Type header.

19.8 Jobs List

GET /api/jobs
-> Array of all job objects currently in the in-memory store.

The job store is in-memory only, so all jobs are lost on backend restart.

19.9 Training and Autoencoder (Stub Endpoints)

Endpoint	Status
`GET /api/autoencoder/info`	Returns empty arrays.
`POST /api/jobs/train-lora`	HTTP 501.
`POST /api/jobs/pre-encode`	HTTP 501.
`POST /api/autoencoder/encode`	HTTP 501.
`POST /api/autoencoder/decode`	HTTP 501.

19.10 Presets

GET /api/presets   -> []
POST /api/presets  -> { "id": "<uuid>", "saved": true }

Reserved for future use.

19.11 Assistant

All routes under /api/assistant are provided by backend/assistant_routes.py.

Provider catalog

GET /api/assistant/providers
→ { "providers": [ { "id", "label", "default_model", "has_key", "is_local" }, ... ] }

Returns one entry per configured provider. claude (Claude Code CLI) is always present with has_key: true. Remote providers with no API key in the environment are still listed, and requests to them fail until a key is added.

Model discovery

GET /api/assistant/models/{provider_id}
→ { "models": [...], "model_ids": [...], "error": null | string }

For openrouter and openrouter-free, this fetches the model list from OpenRouter with a free or paid filter. For ollama, it queries the local /api/tags endpoint. For gemini, it queries the Google model list. All others use the standard /v1/models endpoint.

Chat (streaming)

POST /api/assistant/chat   application/json

Request body:

{
  "provider": "gemini",
  "model": "gemini-2.0-flash",
  "messages": [ { "role": "user", "content": "..." } ],
  "attachments": [],
  "claudeMode": "interactive"
}

Response: server-sent events (text/event-stream). Each event is a JSON object:

{ "type": "delta", "content": "..." }
{ "type": "done" }
{ "type": "error", "message": "..." }

For the claude provider, claudeMode controls process lifecycle: interactive and persistent keep a warm Claude Code stream-json process across messages; oneshot and resume spawn per message.

Key pool management

POST   /api/assistant/keys/{provider_id}/ingest      → add keys (body: newline-separated or JSON array)
DELETE /api/assistant/keys/{provider_id}/{key_hash}  → remove one key
DELETE /api/assistant/keys/{provider_id}             → clear all keys for provider
GET    /api/assistant/keys                            → status for all providers
GET    /api/assistant/keys/{provider_id}             → status for one provider
GET    /api/assistant/keys/{provider_id}/raw         → key hashes with full status

Keys are stored in memory and lost on restart. The pool round-robins across available keys per provider, tracking last-used timestamps and failure counts.

RAG reindex

GET /api/assistant/reindex
→ { "status": "ok", "chunks_indexed": N }

Forces a full re-parse and re-embedding of USER_GUIDE.md into the ChromaDB vector store. Called automatically at startup.

19.12 Module Loader

GET /api/modules
-> [ { "name", "label", "enabled", "api_prefix", ... }, ... ]

Returns the manifests (module.json contents) for every module that loaded successfully at startup. A module that failed to load does not appear in this list.

Settings and module-management endpoints used by the Settings modal:

GET   /api/settings
PATCH /api/settings
GET   /api/modules/all
PATCH /api/modules/{dirName}/enabled
POST  /api/admin/restart
POST  /api/admin/shutdown

PATCH /api/settings accepts a partial nested object and silently drops unknown keys, so newer and older frontends stay compatible. Module enablement is persisted in module manifests and settings and takes effect according to each module's loader behavior; a backend restart may be required for import-time changes.

19.13 Disk-backed Library

The default local storage provider uses /api/library.

GET    /api/library/entries
GET    /api/library/entries/{entry_id}
GET    /api/library/audio/{entry_id}
PATCH  /api/library/entries/{entry_id}
DELETE /api/library/entries/{entry_id}
POST   /api/library/import
GET    /api/library/{entry_id}/bundle
GET    /api/library/{entry_id}/lineage?depth=3
GET    /api/library/_graph/all
GET    /api/library/_all/stems
GET    /api/library/_all/midi
GET    /api/library/stems/{stem_id}/audio

Audio responses use FileResponse and support browser range requests. Imports accept multipart audio plus JSON metadata. Bundle responses are ZIP files containing audio, metadata, analysis, stems, MIDI, and lineage edges when available.

19.14 Chimera

GET  /api/chimera/probe
POST /api/chimera/probe/refresh
POST /api/chimera/mashup

POST /api/chimera/mashup accepts multiple uploaded files plus target_bpm, optional base_index, JSON weights, align_mode (start, downbeat, weave), out_sr, and weave-specific bar and polyphony controls. The response contains base64 WAV audio plus sample_rate, duration_sec, target_bpm_used, target_bpm_source, align_mode_used, per-clip metadata, and warnings.

19.15 Stems

GET  /api/stems/probe
GET  /api/stems/status
POST /api/stems/install
POST /api/stems/start
POST /api/stems/stop
POST /api/stems/{entry_id}/run?stems=4&device=cuda&quality=high
GET  /api/stems/{entry_id}/progress
POST /api/stems/{entry_id}/abort
GET  /api/stems/{entry_id}

Stem runs operate on disk-backed library entries, hold the backend idle gate while running, and persist rows in the library database so stems appear in details, bundles, lineage, and send-target workflows.

19.16 MIDI

GET  /api/midi
POST /api/midi/install?engine=basic_pitch
POST /api/midi/{entry_id}/run?from_stems=true
GET  /api/midi/{entry_id}
GET  /api/midi/file/{midi_id}

MIDI conversion runs on full entries and, when available, separated stems. The streamed .mid files import into the Piano Roll or attach to bundles as downloadable artifacts.

19.17 VJ

The VJ module powers the VJ center tab and its embedded iframe. The frontend fetches the engine URL from GET /api/vj/url, which also returns a mobile_url for LAN access when the machine has a non-loopback address. The backend spawns the visual engine's dev server lazily; the port and project path can be overridden through the theDAW_VJ_PORT and theDAW_VJ_PROJECT environment variables. Use the Settings modal module list and the VJ tab loading state to confirm availability.

20. Python Pipeline Reference

20.1 Text-to-audio

from stable_audio_3 import StableAudioModel

pipe = StableAudioModel.from_pretrained("medium")
audio = pipe.generate(
    prompt="Lo-fi boom bap meets orchestral strings, 84 BPM",
    duration=180,
)

20.2 Audio-to-audio

import torchaudio

init_audio = torchaudio.load("/path/to/audio.wav")
audio = pipe.generate(
    init_audio=init_audio,
    init_noise_level=0.9,
    prompt="bossa nova bassline",
    duration=30,
)

20.3 Inpainting

inpaint_audio = torchaudio.load("/path/to/audio.wav")
audio = pipe.generate(
    inpaint_audio=inpaint_audio,
    inpaint_mask_start_seconds=4.0,
    inpaint_mask_end_seconds=8.0,
    prompt="punchy kick drum fill",
    duration=30,
)

Continuation: set inpaint_mask_start_seconds equal to the source clip's total length and duration to the desired output length. The model fills the extension conditioned on the tail of the source.

20.4 Autoencoder

from stable_audio_3 import AutoencoderModel

ae = AutoencoderModel.from_pretrained("same-l")
waveform, sr = torchaudio.load("audio.wav")
latents = ae.encode(waveform, sr)
audio_out = ae.decode(latents)

Batch encoding, chunked processing, and dataset pre-encoding for LoRA training: see docs/workflows/autoencoder.md.

20.5 LoRA at Inference

pipe = StableAudioModel.from_pretrained("medium")
pipe.load_lora("style.safetensors", weight=0.8)
audio = pipe.generate(prompt="...", duration=30)

Multiple LoRAs stack additively. Strength is adjustable at runtime:

from stable_audio_3.models.lora import set_lora_strength

set_lora_strength(pipe.model, 0.5)                  # all LoRAs
set_lora_strength(pipe.model, 1.0, lora_index=0)    # first LoRA only

20.6 Advanced Generation Parameters

audio = pipe.generate(
    prompt="...",
    duration=30,
    sampler_type="dpmpp_2m_sde",    # euler | rk4 | dpmpp_2m_sde | ping_pong
    sigma_max=1.0,                   # max noise level
    apg_scale=1.0,                   # Adaptive Projected Guidance scale
    cfg_interval=(0.0, 1.0),         # sigma range over which CFG is applied
)

Full parameter reference: stable_audio_3/model.py:StableAudioModel.generate.

21. Models

Key	Flavor	Params	Autoencoder	Hardware requirement	Max duration
`small`	ARC	433 M	SAME-S (266 M)	CPU	120 s
`medium`	ARC	1.4 B	SAME-L (1.7 B)	GPU (CUDA)	380 s
`small-rf`	RF	433 M	SAME-S	CPU	120 s
`medium-rf`	RF	1.4 B	SAME-L	GPU (CUDA)	380 s
`same-s`	Autoencoder	266 M	n/a	CPU	n/a
`same-l`	Autoencoder	1.7 B	n/a	GPU	n/a

ARC checkpoints bundle the autoencoder. Standalone SAME checkpoints share weights with the bundled version and reuse the cached full checkpoint when both are available. The Small model runs on modest GPUs; the Medium model needs around 8 GB of VRAM.

22. LoRA Adapter Types

Eight adapter types are available, trading parameter count against expressiveness:

Type	Trainable params per layer	Description
`lora`	`rank × (fan_in + fan_out)`	Standard LoRA. Two low-rank matrices A and B; update is `(alpha/rank) × B @ A`.
`dora-rows`	`rank × (fan_in + fan_out) + fan_out`	DoRA with per-row magnitude. The weight update is decomposed into direction and per-output-neuron magnitude. Default adapter type.
`dora-cols`	`rank × (fan_in + fan_out) + fan_in`	DoRA with per-column (per-input-feature) magnitude.
`bora`	`rank × (fan_in + fan_out) + fan_in + fan_out`	Bi-dimensional DoRA. Independent row and column magnitude scaling.
`lora-xs`	`rank²`	Maximum parameter efficiency. Only a `(rank, rank)` core matrix is trained; U and V bases are frozen SVD factors of the original weight.
`dora-rows-xs`	`rank² + fan_out`	DoRA-rows combined with the LoRA-XS frozen SVD bases.
`dora-cols-xs`	`rank² + fan_in`	DoRA-cols combined with LoRA-XS bases.
`bora-xs`	`rank² + fan_in + fan_out`	BoRA combined with LoRA-XS bases.

Training configuration

Argument	Default	Description
`--rank`	16	LoRA rank. Lower means fewer parameters; higher means more capacity.
`--lora_alpha`	equal to `--rank`	Scaling factor. Effective scale is alpha divided by rank.
`--adapter_type`	`dora-rows`	Adapter type from the table above.
`--dropout`	0.0	Dropout on LoRA inputs during training.
`--include`	all layers	Restrict LoRA to layers whose name contains one of these substrings. Bracket ranges are supported: `layers[0-11]`.
`--exclude`	none	Skip layers matching any of these substrings, even if they match `--include`.
`--svd_bases_path`	none	Pre-computed SVD bases `.pt` file. Eliminates per-layer SVD at startup for `-XS` adapters.
`--base_precision`	none	Cast frozen base weights to `bf16` after applying LoRA. Reduces VRAM usage; LoRA parameters stay in fp32.
`--lora_checkpoint`	none	Existing checkpoint to resume from. Loaded with `strict=False`.

Full training walkthrough: docs/workflows/lora.md.

23. Troubleshooting

Generation produces a static glitch (Medium model)

Flash Attention is not loaded correctly. Verify:

uv run python -c "import flash_attn; from flash_attn import flash_attn_func; print('Version:', flash_attn.__version__)"

Any import error means the wheel does not match the installed Python, PyTorch, and CUDA combination. Reinstall from kingbri1/flash-attention.

"API UNREACHABLE" banner in the header

The backend is not responding on port 8600. Test directly:

curl http://localhost:8600/api/health

If the request fails, restart the backend. On Windows, .\start-dev.bat kills stale processes automatically. Manually: taskkill /F /IM uvicorn.exe.

COMMIT EDIT hangs indefinitely

One or more clip source Blobs are failing to decode. The decode step has a 15-second timeout per Blob; if all clips time out, the render never completes. Check the Processing Log for decodeAudioData timeout entries. Likely causes: a corrupted Blob in the library, or an unusually large audio file. Remove the suspect clip and retry.

TRAIN tab displays "TRAINING METADATA FAILED"

/api/autoencoder/info returned a non-OK status. In this fork, the endpoint is a stub that returns an empty list, which is the expected behavior. The TRAIN tab degrades gracefully; LoRA training is available through the command line.

Vite dev server cannot reach `/api`

The proxy block in vite.config.ts is missing, or localhost:8600 is not listening. Verify the backend is running (curl http://localhost:8600/api/health) and that vite.config.ts contains:

server: { proxy: { '/api': 'http://localhost:8600' } }

Audio plays at the wrong speed or pitch

A sample-rate mismatch exists somewhere in the chain. The full pipeline is 44.1 kHz stereo end to end. Verify pipeline.sample_rate and that the uploaded audio file matches.

Out-of-memory on Medium model

The Medium model needs around 8 GB VRAM. Workarounds:

Use the small model.
Reduce duration; shorter sequences consume less peak memory.
Confirm no other CUDA processes are active on the same device.
The Small model runs on modest GPUs; the Medium model needs ~8 GB of VRAM.

Library storage fills the disk

The disk-backed library stores audio under data/generations/. Delete old entries with the trash icon per row to reclaim space. The metadata database lives at data/library.db; both paths are gitignored.

24. Development Workflows

Lint

uv run ruff check
uv run ruff format --check

Ruff excludes stable_audio_3/models, inference, interface, and data. Only top-level files (pipeline.py, model.py, model_configs.py, loading_utils.py, verbose.py) are checked. Run both commands from the repo root.

Tests

uv run pytest                        # Full suite (Medium tests skip on non-CUDA hosts)
uv run pytest tests/test_inference.py
uv run pytest --save-audio           # Write outputs to test_audio_outputs/ for inspection

Session-scoped fixtures avoid reloading models between tests. Medium tests are skipped automatically on hosts without a CUDA GPU.

Frontend build

cd frontend
npm run build      # Outputs to frontend/dist/
npm run preview    # Serves the built bundle locally

Adding a new FFmpeg effect

Add an entry to EFFECT_PARAM_BOUNDS in backend/server.py with the allowed parameter ranges.
Extend _build_filter() (or its equivalent) with the FFmpeg filter graph command for the new effect.
Add the effect to the catalog in frontend/src/lib/effectCatalog.ts and surface it in the MIX tab (frontend/src/views/MixPanel.tsx) with a display label and color class.

Adding a backend module

Create a new directory under backend/modules/ with two required files.

module.json:

{
  "name": "my-module",
  "label": "My Module",
  "enabled": true,
  "api_prefix": "/api/my-module",
  "description": "What this module does"
}

router.py:

from fastapi import APIRouter

router = APIRouter()

@router.get("/status")
async def status():
    return {"ok": True}

The loader (backend/modules/loader.py) discovers the directory on the next server start and mounts the router at api_prefix. Set "enabled": false in module.json to disable without deleting. Any import error is logged and the module is skipped; other modules and the main app continue loading normally.

Backend job persistence

The current async job store (the JOBS dict in backend/server.py) is in-memory, so all jobs are lost on backend restart. For production deployment, swap this for SQLite or Redis; the job object shape is well-defined and the swap is a single-layer change.

Zustand store architecture

Store	State owned
`useEditorStore`	Waveform editor tracks, clips, playhead, tool mode, snap, zoom, inpaint selection.
`usePlayerStore`	HTMLAudioElement-based playback engine (load, play, pause, seek, loop). Shared by all playback sources.
`usePlaybackStore`	Master volume and mute. Read by `playerStore`, the library player, and the sequencer master gain.
`useLibraryStore`	Disk-backed provider facade plus session Blob cache, search/filter/sort state, selected entry.
`useGenerateStore`	Last generation metadata (filename, model, duration) for footer display.
`useGenerateParamsStore`	All MAKE tab form field values.
`useStudioStore`	MIX tab source, output, process history.
`useEffectChainStore`	MIX effect chain entries, order, parameters, enabled flags.
`useEditorPlaybackBridge`	Module-level callback registration (not a Zustand store) decoupling the footer from the editor to avoid circular dependencies.
`usePianoRollStore`	Piano roll note list, BPM, grid length, playback state.
`useMediaBucketStore`	Session-scoped file bucket items.
`useBottomPanelStore`	Bottom panel open state, height, active tab.
`useLogStore`	Processing log ring buffer.
`useAppUiStore`	Center-tab shell state (`make`, `edit`, `mix`, `dj`, `vj`, `train`, `learn`) and the right library rail state.
`useDjLayout`	DJ tab Design Mode layout (panel order, sizes, design-mode flag).

25. Feature Coverage and Screenshot Evidence

theDAW documentation has a repeatable audit loop connecting implemented features to guide coverage and visual proof.

Tip

Treat this section as the docs control room: feature descriptors define what exists, screenshot specs prove it visually, and the coverage report confirms the guide explains it.

25.1 Coverage artifacts

Artifact	Purpose
`scripts/screenshots/specs.ts`	Canonical feature descriptors plus screenshot and crop mapping.
`scripts/screenshots/featureCoverage.ts`	Generates the feature-vs-doc coverage report from `docs/USER_GUIDE.md`.
`docs/reports/feature-doc-coverage-report.md`	Human-readable coverage matrix and screenshot map.
`docs/reports/feature-doc-coverage.json`	Machine-readable report for automation.

The coverage pass records whether repomix-output.md was present and confirms it is not tracked. Repomix is local analysis context only.

25.2 Screenshot generation

The primary screenshot runner is scripts/screenshots/capture.ts:

npm --prefix frontend run screenshots

The runner drives the live app through real DAW interactions, writes full-scene images to docs/screenshots/, and emits crop assets using this naming convention:

<scene-id>.png
<scene-id>__<crop-id>.png

For example, 03-library-actions-toolbar.png documents the Library toolbar as a whole, while 03-library-actions-toolbar__library-toolbar.png focuses tightly on the action cluster. One full screenshot or crop can support multiple feature IDs.

25.3 Current feature-to-screenshot map

Feature ID	Recommended evidence
`shell-center-tabs-right-library`	`01-shell-make.png`, `02-library-with-showcase-selected__library-details.png`
`docs-modal-download-print-rag`	`01-shell-make__header-actions.png`, `docs/UI/screenshots/05-docs-modal.png`
`assistant-orb-providers-keys-attachments`	`01-shell-make__header-actions.png`
`make-advanced-generation-templates-prompts-spectrograms`	`01-shell-make__make-controls.png`
`make-chimera-fusion-stack`	`09-chimera-cohort-multi-select.png`, `01-shell-make__make-controls.png`
`make-mic-recorder-send-targets`	`01-shell-make__make-controls.png`
`mix-effects-chain-quick-master`	`docs/UI/screenshots/02-edit-tab-overview.png` plus future MIX crop
`library-backend-local-storage`	`02-library-with-showcase-selected.png`, `02-library-with-showcase-selected__library-details.png`
`library-bundle-download-lineage-export`	`04-library-download-submenu__download-submenu.png`, `06-learn-tab-3d-graph__lineage-graph.png`
`library-stems-sidecar`	`05-library-entry-right-click__entry-context-menu.png`
`library-midi-conversion`	`04-library-download-submenu__download-submenu.png`, `05-library-entry-right-click__entry-context-menu.png`
`settings-feature-toggles-modules-admin`	`07-settings-modal-with-shutdown__settings-toggles.png`
`sequencer-midi-export-render`	`docs/UI/screenshots/06-step-sequencer.png` plus future sequencer-toolbar crop
`vj-sidecar-tab-mobile-share`	`08-vj-tab-loading__vj-panel.png`, `01-shell-make__header-actions.png`

25.4 Documentation maintenance rule

When adding or changing a feature:

Add or update its FeatureDescriptor in scripts/screenshots/specs.ts.
Map the feature to a full screenshot and, if needed, a crop region.
Update docs/USER_GUIDE.md first.
Run the coverage script and the screenshot runner.
Sync docs/USER_GUIDE.md to frontend/public/USER_GUIDE.md before validating the Docs modal.

Maintained by the theDAW development team.

FilesExpand file tree

USER_GUIDE.md

Latest commit

History

USER_GUIDE.md

File metadata and controls

theDAW User Guide

Table of Contents

1. Repository Anatomy

2. Architecture

Two-stage generation

Checkpoint flavors

Web Audio engine

3. Installation

Base Python environment

Optional Linux CUDA wheels (Medium model)

With dev dependencies

Windows-specific requirements

Frontend dependencies

4. Launching the Application

One-shot launcher (Windows)

Manual launch

5. UI Shell

6. MAKE Tab

Purpose

6.1 Primary Synthesis / Prompt

6.2 Generation Parameters

6.3 Advanced Generation Panel

6.3.1 Chimera Fusion Stack

6.4 Init Signal / Conditioning

6.4.1 Microphone Recorder

6.5 Inpainting / Regen Region

6.6 LoRA / Adaptive Layers

6.7 Output Status Monitor

6.8 Run Generation

7. EDIT Tab

Purpose

7.1 Toolbar

7.2 Per-track Controls

7.3 Per-clip Display and Handles

7.4 Inpainting from the Editor

7.5 Timeline Interaction

7.6 COMMIT EDIT

8. MIX Tab

Purpose

8.1 Layout

8.2 Quick Master

8.3 Effect Catalog and Chain

8.4 Source, Output, and Routing

8.5 Process History

9. DJ Tab

Purpose

9.1 Waveform Hero and Decks

9.2 Per-deck Transport and Cueing

9.3 Center Mixer

9.4 Live Stems and FX

9.5 Cue / Headphone Output

9.6 MIDI Learn

9.7 Automix, Sampler, and Side List

9.8 Browser and Source Tree

9.9 Design Mode

10. VJ Tab

Purpose

10.1 Inputs

10.2 Pop-out and Mobile

10.3 Bridges

10.4 Export

11. TRAIN Tab

Purpose

11.1 Target Architecture

11.2 Pre-encode Workflow

11.3 Autoencoder Round-trip

11.4 Job Polling

12. LEARN Tab

Purpose

12.1 Views

12.2 3D Graph Controls

13. Library

Purpose

13.1 Automatic Entry Creation