简体中文 | English
This is the public configuration index for VoScript v0.8.4. It covers the
environment variables that the current code reads, the per-request override
semantics of POST /api/transcribe, and internal defaults that are documented
for operators but are not stable public knobs yet. Do not assume a Whisper,
diarization, or AS-norm env var exists unless it is listed here.
| Layer | Example | Precedence |
|---|---|---|
| API request field | denoise_model=deepfilternet, snr_threshold=8 |
Per-job only, wins over service env |
| Container environment | .env injected through docker-compose.yml |
Service-level default |
| Code default | app/config.py |
Fallback when env is empty or invalid |
POST /api/transcribe currently exposes only language, min_speakers,
max_speakers, denoise_model, snr_threshold, and no_repeat_ngram_size.
Other pipeline settings may have internal defaults, but they are not public API
parameters yet.
| Variable | Default | Effect |
|---|---|---|
API_KEY |
empty | When set, all endpoints except /, /healthz, /docs, /redoc, /openapi.json, and /static/* require Authorization: Bearer <key> or X-API-Key: <key>. |
ALLOW_NO_AUTH |
0 |
Used only when API_KEY is empty. 1 acknowledges unauthenticated mode and suppresses the startup warning; it does not add protection. |
CORS_ALLOW_ORIGINS |
* |
Comma-separated CORS origins. Narrow this before exposing the service outside a trusted network. |
HOST_PORT |
8780 |
Host port published by compose; not an app runtime env var. |
MAX_UPLOAD_BYTES |
2147483648 |
Per-upload byte cap. Larger uploads return 413 and the partial file is removed. |
DATA_DIR |
/data |
In-container data root for transcriptions, uploads, and voiceprints. Compose mounts host ./data to /data by default. |
MODEL_CACHE_DIR |
./models |
Host model-cache directory used by compose, mounted to /cache and read-only /models. |
APP_UID / APP_GID |
1000 / 1000 |
Container runtime user. Host DATA_DIR and MODEL_CACHE_DIR must be writable by this uid/gid. |
DEVICE |
cuda |
Pipeline inference device. Use cpu for CPU-only, macOS, or non-NVIDIA hosts. |
CUDA_VISIBLE_DEVICES |
unset | Optional NVIDIA visibility limit. By default this variable is not injected and compose requests every Docker-exposed GPU. Add it only through docker-compose.override.yml or another explicit operator env override when you need to restrict the visible GPU set; inside the container, cuda:0 is the first visible GPU and may not be physical host GPU0. For CPU-only mode, set DEVICE=cpu. |
FFMPEG_TIMEOUT_SEC |
1800 |
ffmpeg conversion timeout in seconds; timeout returns 504. |
JOBS_MAX_CACHE |
200 |
In-memory job LRU limit. Evicted completed jobs remain queryable from disk status.json / result.json. |
MODEL_IDLE_TIMEOUT_SEC |
180 |
GPU model idle-unload timeout, defaulting to 180 seconds (3 minutes). Set 0 to disable idle unload and keep models resident. When enabled, loaded models are released only after the serialized GPU runtime has been idle for this many seconds; on the next reload, ASR, diarization, and embedding each choose the visible CUDA device with the most free memory during their own lazy load. |
RUST_KERNEL_MODE |
off |
Optional Rust-backed provider/kernel mode. off keeps Python implementations; required makes selected Rust-backed paths import and run successfully or fail closed. The current selected paths are voiceprint scoring, result post-processing, and artifact manifest helper contracts; CI / Docker packaging still validates the Rust extension directly when the runtime default is off. |
MODELS_DIR and LANGUAGE are defined in the config module, but v0.8.4's main
HTTP transcription path does not use them as stable public tuning knobs:
Whisper local checkpoint lookup still expects /models/faster-whisper-<WHISPER_MODEL>,
and default language should be controlled with the request language field or
left empty for auto-detection.
Idle unload is a memory-pressure feature, not a throughput feature. The unload daemon shares the same serialized GPU semaphore as transcription work and rechecks the idle timestamp after acquiring it, so a queued or freshly completed job cannot be unloaded based on a stale pre-wait observation. CUDA cache release is best-effort and is skipped safely on CPU-only hosts.
Docker Compose requests all available NVIDIA GPUs with count: all by default
and does not set CUDA_VISIBLE_DEVICES, so the container can see every
Docker-exposed GPU. DEVICE=cuda lets each model choose the best visible GPU
when it lazy-loads; DEVICE=cuda:0 or another indexed value pins to that
in-container visible index and will not auto-move to another GPU.
To restrict visibility, create a local, uncommitted docker-compose.override.yml:
services:
voscript:
environment:
- CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}Then set CUDA_VISIBLE_DEVICES=1,3 in your local .env or launch environment.
After that, in-container cuda:0 maps to the first device in the visible set,
not necessarily physical host GPU0.
| Variable | Default | Effect |
|---|---|---|
HF_TOKEN |
empty | Token for pyannote / WeSpeaker gated models. Accept the relevant Hugging Face model terms first. |
HF_ENDPOINT |
https://huggingface.co |
Hugging Face Hub endpoint; use a trusted mirror in restricted networks. |
HF_HUB_DISABLE_XET |
1 |
Bypasses hf-xet/CAS downloads by default. Set 0 only when your environment supports hf-xet reliably. |
HF_HUB_ETAG_TIMEOUT |
3 |
Hub metadata timeout in seconds, so slow networks fall back to local cache quickly. |
HF_HOME / HUGGINGFACE_HUB_CACHE / TORCH_HOME / XDG_CACHE_HOME |
/cache-based paths |
Dockerfile cache defaults. Usually configure the host mount through MODEL_CACHE_DIR instead of overriding these one by one. |
faster-whisper first looks for /models/faster-whisper-<WHISPER_MODEL>; if it
does not exist, it loads by model name. pyannote and WeSpeaker first try a
complete local Hugging Face snapshot and fall back to Hub loading only when the
cache is incomplete.
| Setting | Default | Supported Today |
|---|---|---|
WHISPER_MODEL |
large-v3 |
Service env. Supports tiny, base, small, medium, large-v3, and other faster-whisper model names. |
DEVICE |
cuda |
Service env. cuda / cuda:<index> uses float16; cpu uses int8. Compute type is not separately configurable yet. |
API language |
auto-detect | Per-request field. Empty means auto-detect and use the Mandarin-oriented initial prompt. |
API no_repeat_ngram_size |
0 |
Per-request field. Values >=3 are passed to faster-whisper to suppress n-gram repetition; non-integers return 422. |
Current internal ASR defaults are beam_size=5, vad_filter=True,
vad_parameters.min_silence_duration_ms=500, and condition_on_previous_text=False.
These do not have env or API fields in v0.8.4. Do not configure nonexistent
variables such as WHISPER_BEAM_SIZE, WHISPER_COMPUTE_TYPE, or WHISPER_VAD_*.
| Setting | Default | Effect |
|---|---|---|
DENOISE_MODEL |
none |
Service default backend: none, deepfilternet, or noisereduce. Unknown values log a warning and skip denoising. |
DENOISE_SNR_THRESHOLD |
10.0 |
DeepFilterNet SNR gate in dB. When deepfilternet is selected, audio estimated at or above this value is skipped to avoid degrading clean recordings; noisereduce does not use this gate. |
API denoise_model |
omitted | Omitted means inherit DENOISE_MODEL; explicit none disables denoising for this job only. |
API snr_threshold |
omitted | Omitted means inherit DENOISE_SNR_THRESHOLD; explicit values override the DeepFilterNet SNR gate for this job only. |
v0.8.4 defaults to DENOISE_MODEL=none for clean meeting-recorder audio. Enable
deepfilternet or noisereduce only for noisy environments, either per job or
as a service default. If you need clean recordings to be skipped automatically,
use deepfilternet; noisereduce runs whenever it is selected.
| Setting | Default | Effect |
|---|---|---|
API min_speakers / max_speakers |
0 |
Per-request speaker-count bounds. 0 means auto and is not passed to pyannote. |
PYANNOTE_MIN_DURATION_OFF |
0.5 |
pyannote _binarize.min_duration_off, used to merge short pauses and reduce over-segmentation. If the pyannote object does not support it, the service logs a warning and continues. |
WHISPERX_ALIGN_DISABLED_LANGUAGES |
empty | Comma-separated languages that skip forced alignment when no model override is present. Use only as a temporary operational fallback. |
WHISPERX_ALIGN_DEVICE |
cpu |
Runtime device for WhisperX forced alignment. CPU is the default to isolate wav2vec2 alignment from GPU ASR / speaker-embedding runtimes; set to pipeline / asr / cuda / cuda:0 only after validating CUDA alignment stability. |
WHISPERX_ALIGN_MODEL_MAP |
empty | Comma-separated lang=model overrides, for example zh=org/model. |
WHISPERX_ALIGN_MODEL_DIR |
empty | Optional alignment model directory; passed through only when the installed WhisperX supports that parameter. |
WHISPERX_ALIGN_CACHE_ONLY |
0 |
When 1, requests cache-only alignment model loading, only when supported by the installed WhisperX. |
Alignment is optional metadata. On success, results may include
alignment.status=succeeded and segments[].words. If disabled or failed, the
job still completes; words may be absent and alignment records skipped or
failed using sanitized metadata. Clients must treat both fields as optional.
| Variable | Default | Effect |
|---|---|---|
EMBEDDING_DIM |
256 |
Voiceprint vector dimension used for DB and AS-norm cohort shape checks. Do not mix existing stores across dimensions. |
MIN_EMBED_DURATION |
1.5 |
Diarization turns shorter than this are ignored for speaker embedding extraction. |
MAX_EMBED_DURATION |
10.0 |
Longer turns are clipped to this window before embedding extraction. |
Each speaker cluster uses up to the 10 longest usable chunks to produce an averaged embedding. Very short, fragmented, or noisy turns reduce enrollment and matching quality.
| Item | Default | Notes |
|---|---|---|
VOICEPRINT_THRESHOLD |
0.75 |
Base threshold for raw cosine mode. The effective threshold adapts by sample count and sample_spread. |
| Raw single-sample relaxation | 0.05 |
One-sample speakers default to an effective threshold around 0.70. Internal default, not env. |
| Raw spread relaxation | 3.0 * sample_spread, capped at 0.10 |
Multi-sample speakers with larger sample spread get a moderate relaxation. Internal default. |
| Raw absolute floor | 0.60 |
Raw cosine auto-naming never accepts below this value. Internal default. |
| AS-norm activation | 10 cohort embeddings |
When cohort size is below 10, ASNormScorer.score() falls back to raw cosine. Internal default. |
| AS-norm base | 0.5 |
Z-score-like operating point once the cohort is large enough; not raw cosine. Internal default. |
| AS-norm top-1/top-2 margin | 0.05 |
If the best normalized candidate is too close to the second candidate, the speaker remains unnamed. Internal default. |
AS-norm cohort top_n |
200 |
Number of nearest cohort impostors used for AS-norm statistics, capped by cohort size. Internal default. |
similarity depends on cohort state:
- Cohort < 10 or AS-norm unavailable:
similarityis raw cosine, usually in[-1, 1]. - Cohort >= 10:
similarityis an AS-norm normalized score and may exceed1or be negative. - Only
speaker_id != nullmeans the candidate passed the effective threshold for the current mode; do not displaysimilarityas a percentage.
Cohort lifecycle:
- On startup, an existing
data/transcriptions/asnorm_cohort.npyis loaded directly. - Otherwise, the service scans persisted transcription results and
emb_*.npyfiles to build and save a cohort. - After each enroll / update, the background
cohort-rebuildthread wakes every 60 seconds and rebuilds after the latest enrollment is at least 30 seconds old. - v0.8.4 protects larger loaded or persisted cohorts during automatic rebuilds: clearing transcription results, having only a few embeddings, or having fewer source embeddings than the current cohort will not shrink the cohort automatically.
POST /api/voiceprints/rebuild-cohortis an explicit manual rebuild and uses the currently available embeddings immediately.
Stable anchors in completed transcription results:
status: persisted result status iscompleted; the job endpoint can also reportqueued,converting,denoising,transcribing,identifying, orfailed.segments[].speaker_label: raw pyannote cluster label, the stable key for enrollment and later correction.segments[].speaker_name: display name; falls back tospeaker_labelwhen unmatched, and is disambiguated when multiple clusters hit the same enrolled name.segments[].speaker_id: matched voiceprint ID, ornull.segments[].similarity: speaker-level match score; raw cosine or AS-norm z-score depending on cohort state.segments[].words: optional word-level alignment.- Top-level
alignment: optional forced-alignment metadata, sanitized. - Top-level
params: effective per-job processing settings, including request overrides and service defaults used for this result. - Top-level
artifacts: optional artifact manifest listing stable / optional / experimental artifact filenames, roles, categories, media types, andspeaker_labelvalues; it never exposes local paths, hosts, tokens, or debug data. speaker_map: diarization cluster to voiceprint match map; manual segment corrections do not rewrite it.unique_speakers: deduplicated current segment display names.
New fields are added under the optional-field principle. Clients should ignore
unknown fields and tolerate missing words, alignment, artifacts, and
warning.
v0.8.4 has internal live validation covering the optional Rust kernel foundation, selected voiceprint scoring, result post-processing, artifact/status helper contracts, Docker packaging smoke, and public release-scan gates. Public documentation records only these behavior categories, not real task names, sample names, job IDs, speaker IDs, hosts, logs, or paths.
v0.7.6 has internal live validation covering /healthz availability during GPU
cleanup, WhisperX forced-alignment runtime isolation and model reuse, short
single-segment stock outro hallucination filtering, and the embedding path that
loads the normalized WAV once and slices it by diarization turns. Public
documentation records only these behavior categories, not real task names,
sample names, job IDs, speaker IDs, hosts, logs, or paths.
v0.7.4 has internal live validation covering transcription cleanup while retaining voiceprints: as long as the voiceprint DB and a loaded or persisted AS-norm cohort remain, automatic background rebuilds do not shrink a larger cohort to an empty or undersized one. New-voice enroll, cohort rebuild, probe hit, and cleanup entrypoints were also covered. The current public validation does not have trustworthy >=10 cohort evidence, so it only proves the voiceprint API, cohort refresh entrypoint, and raw-cosine fallback are usable; it must not claim the probe exercised the full AS-norm scoring path. Full AS-norm validation requires cohort size >=10. Public documentation records only the behavioral conclusion, not real task names, sample names, job IDs, speaker IDs, hosts, or paths.