Skip to content

Latest commit

 

History

History
245 lines (199 loc) · 16 KB

File metadata and controls

245 lines (199 loc) · 16 KB

Full Configuration and Tuning Reference

简体中文 | English

This is the public configuration index for VoScript v0.8.4. It covers the environment variables that the current code reads, the per-request override semantics of POST /api/transcribe, and internal defaults that are documented for operators but are not stable public knobs yet. Do not assume a Whisper, diarization, or AS-norm env var exists unless it is listed here.

Configuration Sources and Precedence

Layer Example Precedence
API request field denoise_model=deepfilternet, snr_threshold=8 Per-job only, wins over service env
Container environment .env injected through docker-compose.yml Service-level default
Code default app/config.py Fallback when env is empty or invalid

POST /api/transcribe currently exposes only language, min_speakers, max_speakers, denoise_model, snr_threshold, and no_repeat_ngram_size. Other pipeline settings may have internal defaults, but they are not public API parameters yet.

Service Basics

Variable Default Effect
API_KEY empty When set, all endpoints except /, /healthz, /docs, /redoc, /openapi.json, and /static/* require Authorization: Bearer <key> or X-API-Key: <key>.
ALLOW_NO_AUTH 0 Used only when API_KEY is empty. 1 acknowledges unauthenticated mode and suppresses the startup warning; it does not add protection.
CORS_ALLOW_ORIGINS * Comma-separated CORS origins. Narrow this before exposing the service outside a trusted network.
HOST_PORT 8780 Host port published by compose; not an app runtime env var.
MAX_UPLOAD_BYTES 2147483648 Per-upload byte cap. Larger uploads return 413 and the partial file is removed.
DATA_DIR /data In-container data root for transcriptions, uploads, and voiceprints. Compose mounts host ./data to /data by default.
MODEL_CACHE_DIR ./models Host model-cache directory used by compose, mounted to /cache and read-only /models.
APP_UID / APP_GID 1000 / 1000 Container runtime user. Host DATA_DIR and MODEL_CACHE_DIR must be writable by this uid/gid.
DEVICE cuda Pipeline inference device. Use cpu for CPU-only, macOS, or non-NVIDIA hosts.
CUDA_VISIBLE_DEVICES unset Optional NVIDIA visibility limit. By default this variable is not injected and compose requests every Docker-exposed GPU. Add it only through docker-compose.override.yml or another explicit operator env override when you need to restrict the visible GPU set; inside the container, cuda:0 is the first visible GPU and may not be physical host GPU0. For CPU-only mode, set DEVICE=cpu.
FFMPEG_TIMEOUT_SEC 1800 ffmpeg conversion timeout in seconds; timeout returns 504.
JOBS_MAX_CACHE 200 In-memory job LRU limit. Evicted completed jobs remain queryable from disk status.json / result.json.
MODEL_IDLE_TIMEOUT_SEC 180 GPU model idle-unload timeout, defaulting to 180 seconds (3 minutes). Set 0 to disable idle unload and keep models resident. When enabled, loaded models are released only after the serialized GPU runtime has been idle for this many seconds; on the next reload, ASR, diarization, and embedding each choose the visible CUDA device with the most free memory during their own lazy load.
RUST_KERNEL_MODE off Optional Rust-backed provider/kernel mode. off keeps Python implementations; required makes selected Rust-backed paths import and run successfully or fail closed. The current selected paths are voiceprint scoring, result post-processing, and artifact manifest helper contracts; CI / Docker packaging still validates the Rust extension directly when the runtime default is off.

MODELS_DIR and LANGUAGE are defined in the config module, but v0.8.4's main HTTP transcription path does not use them as stable public tuning knobs: Whisper local checkpoint lookup still expects /models/faster-whisper-<WHISPER_MODEL>, and default language should be controlled with the request language field or left empty for auto-detection.

Idle unload is a memory-pressure feature, not a throughput feature. The unload daemon shares the same serialized GPU semaphore as transcription work and rechecks the idle timestamp after acquiring it, so a queued or freshly completed job cannot be unloaded based on a stale pre-wait observation. CUDA cache release is best-effort and is skipped safely on CPU-only hosts.

Docker Compose requests all available NVIDIA GPUs with count: all by default and does not set CUDA_VISIBLE_DEVICES, so the container can see every Docker-exposed GPU. DEVICE=cuda lets each model choose the best visible GPU when it lazy-loads; DEVICE=cuda:0 or another indexed value pins to that in-container visible index and will not auto-move to another GPU.

To restrict visibility, create a local, uncommitted docker-compose.override.yml:

services:
  voscript:
    environment:
      - CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}

Then set CUDA_VISIBLE_DEVICES=1,3 in your local .env or launch environment. After that, in-container cuda:0 maps to the first device in the visible set, not necessarily physical host GPU0.

Hugging Face and Model Cache

Variable Default Effect
HF_TOKEN empty Token for pyannote / WeSpeaker gated models. Accept the relevant Hugging Face model terms first.
HF_ENDPOINT https://huggingface.co Hugging Face Hub endpoint; use a trusted mirror in restricted networks.
HF_HUB_DISABLE_XET 1 Bypasses hf-xet/CAS downloads by default. Set 0 only when your environment supports hf-xet reliably.
HF_HUB_ETAG_TIMEOUT 3 Hub metadata timeout in seconds, so slow networks fall back to local cache quickly.
HF_HOME / HUGGINGFACE_HUB_CACHE / TORCH_HOME / XDG_CACHE_HOME /cache-based paths Dockerfile cache defaults. Usually configure the host mount through MODEL_CACHE_DIR instead of overriding these one by one.

faster-whisper first looks for /models/faster-whisper-<WHISPER_MODEL>; if it does not exist, it loads by model name. pyannote and WeSpeaker first try a complete local Hugging Face snapshot and fall back to Hub loading only when the cache is incomplete.

Whisper / ASR

Setting Default Supported Today
WHISPER_MODEL large-v3 Service env. Supports tiny, base, small, medium, large-v3, and other faster-whisper model names.
DEVICE cuda Service env. cuda / cuda:<index> uses float16; cpu uses int8. Compute type is not separately configurable yet.
API language auto-detect Per-request field. Empty means auto-detect and use the Mandarin-oriented initial prompt.
API no_repeat_ngram_size 0 Per-request field. Values >=3 are passed to faster-whisper to suppress n-gram repetition; non-integers return 422.

Current internal ASR defaults are beam_size=5, vad_filter=True, vad_parameters.min_silence_duration_ms=500, and condition_on_previous_text=False. These do not have env or API fields in v0.8.4. Do not configure nonexistent variables such as WHISPER_BEAM_SIZE, WHISPER_COMPUTE_TYPE, or WHISPER_VAD_*.

Denoising

Setting Default Effect
DENOISE_MODEL none Service default backend: none, deepfilternet, or noisereduce. Unknown values log a warning and skip denoising.
DENOISE_SNR_THRESHOLD 10.0 DeepFilterNet SNR gate in dB. When deepfilternet is selected, audio estimated at or above this value is skipped to avoid degrading clean recordings; noisereduce does not use this gate.
API denoise_model omitted Omitted means inherit DENOISE_MODEL; explicit none disables denoising for this job only.
API snr_threshold omitted Omitted means inherit DENOISE_SNR_THRESHOLD; explicit values override the DeepFilterNet SNR gate for this job only.

v0.8.4 defaults to DENOISE_MODEL=none for clean meeting-recorder audio. Enable deepfilternet or noisereduce only for noisy environments, either per job or as a service default. If you need clean recordings to be skipped automatically, use deepfilternet; noisereduce runs whenever it is selected.

Diarization and Alignment

Setting Default Effect
API min_speakers / max_speakers 0 Per-request speaker-count bounds. 0 means auto and is not passed to pyannote.
PYANNOTE_MIN_DURATION_OFF 0.5 pyannote _binarize.min_duration_off, used to merge short pauses and reduce over-segmentation. If the pyannote object does not support it, the service logs a warning and continues.
WHISPERX_ALIGN_DISABLED_LANGUAGES empty Comma-separated languages that skip forced alignment when no model override is present. Use only as a temporary operational fallback.
WHISPERX_ALIGN_DEVICE cpu Runtime device for WhisperX forced alignment. CPU is the default to isolate wav2vec2 alignment from GPU ASR / speaker-embedding runtimes; set to pipeline / asr / cuda / cuda:0 only after validating CUDA alignment stability.
WHISPERX_ALIGN_MODEL_MAP empty Comma-separated lang=model overrides, for example zh=org/model.
WHISPERX_ALIGN_MODEL_DIR empty Optional alignment model directory; passed through only when the installed WhisperX supports that parameter.
WHISPERX_ALIGN_CACHE_ONLY 0 When 1, requests cache-only alignment model loading, only when supported by the installed WhisperX.

Alignment is optional metadata. On success, results may include alignment.status=succeeded and segments[].words. If disabled or failed, the job still completes; words may be absent and alignment records skipped or failed using sanitized metadata. Clients must treat both fields as optional.

Embedding

Variable Default Effect
EMBEDDING_DIM 256 Voiceprint vector dimension used for DB and AS-norm cohort shape checks. Do not mix existing stores across dimensions.
MIN_EMBED_DURATION 1.5 Diarization turns shorter than this are ignored for speaker embedding extraction.
MAX_EMBED_DURATION 10.0 Longer turns are clipped to this window before embedding extraction.

Each speaker cluster uses up to the 10 longest usable chunks to produce an averaged embedding. Very short, fragmented, or noisy turns reduce enrollment and matching quality.

Voiceprints and AS-norm

Item Default Notes
VOICEPRINT_THRESHOLD 0.75 Base threshold for raw cosine mode. The effective threshold adapts by sample count and sample_spread.
Raw single-sample relaxation 0.05 One-sample speakers default to an effective threshold around 0.70. Internal default, not env.
Raw spread relaxation 3.0 * sample_spread, capped at 0.10 Multi-sample speakers with larger sample spread get a moderate relaxation. Internal default.
Raw absolute floor 0.60 Raw cosine auto-naming never accepts below this value. Internal default.
AS-norm activation 10 cohort embeddings When cohort size is below 10, ASNormScorer.score() falls back to raw cosine. Internal default.
AS-norm base 0.5 Z-score-like operating point once the cohort is large enough; not raw cosine. Internal default.
AS-norm top-1/top-2 margin 0.05 If the best normalized candidate is too close to the second candidate, the speaker remains unnamed. Internal default.
AS-norm cohort top_n 200 Number of nearest cohort impostors used for AS-norm statistics, capped by cohort size. Internal default.

similarity depends on cohort state:

  • Cohort < 10 or AS-norm unavailable: similarity is raw cosine, usually in [-1, 1].
  • Cohort >= 10: similarity is an AS-norm normalized score and may exceed 1 or be negative.
  • Only speaker_id != null means the candidate passed the effective threshold for the current mode; do not display similarity as a percentage.

Cohort lifecycle:

  • On startup, an existing data/transcriptions/asnorm_cohort.npy is loaded directly.
  • Otherwise, the service scans persisted transcription results and emb_*.npy files to build and save a cohort.
  • After each enroll / update, the background cohort-rebuild thread wakes every 60 seconds and rebuilds after the latest enrollment is at least 30 seconds old.
  • v0.8.4 protects larger loaded or persisted cohorts during automatic rebuilds: clearing transcription results, having only a few embeddings, or having fewer source embeddings than the current cohort will not shrink the cohort automatically.
  • POST /api/voiceprints/rebuild-cohort is an explicit manual rebuild and uses the currently available embeddings immediately.

Result Contract

Stable anchors in completed transcription results:

  • status: persisted result status is completed; the job endpoint can also report queued, converting, denoising, transcribing, identifying, or failed.
  • segments[].speaker_label: raw pyannote cluster label, the stable key for enrollment and later correction.
  • segments[].speaker_name: display name; falls back to speaker_label when unmatched, and is disambiguated when multiple clusters hit the same enrolled name.
  • segments[].speaker_id: matched voiceprint ID, or null.
  • segments[].similarity: speaker-level match score; raw cosine or AS-norm z-score depending on cohort state.
  • segments[].words: optional word-level alignment.
  • Top-level alignment: optional forced-alignment metadata, sanitized.
  • Top-level params: effective per-job processing settings, including request overrides and service defaults used for this result.
  • Top-level artifacts: optional artifact manifest listing stable / optional / experimental artifact filenames, roles, categories, media types, and speaker_label values; it never exposes local paths, hosts, tokens, or debug data.
  • speaker_map: diarization cluster to voiceprint match map; manual segment corrections do not rewrite it.
  • unique_speakers: deduplicated current segment display names.

New fields are added under the optional-field principle. Clients should ignore unknown fields and tolerate missing words, alignment, artifacts, and warning.

v0.8.4 Validation Wording

v0.8.4 has internal live validation covering the optional Rust kernel foundation, selected voiceprint scoring, result post-processing, artifact/status helper contracts, Docker packaging smoke, and public release-scan gates. Public documentation records only these behavior categories, not real task names, sample names, job IDs, speaker IDs, hosts, logs, or paths.

v0.7.6 Validation Wording

v0.7.6 has internal live validation covering /healthz availability during GPU cleanup, WhisperX forced-alignment runtime isolation and model reuse, short single-segment stock outro hallucination filtering, and the embedding path that loads the normalized WAV once and slices it by diarization turns. Public documentation records only these behavior categories, not real task names, sample names, job IDs, speaker IDs, hosts, logs, or paths.

v0.7.4 Validation Wording

v0.7.4 has internal live validation covering transcription cleanup while retaining voiceprints: as long as the voiceprint DB and a loaded or persisted AS-norm cohort remain, automatic background rebuilds do not shrink a larger cohort to an empty or undersized one. New-voice enroll, cohort rebuild, probe hit, and cleanup entrypoints were also covered. The current public validation does not have trustworthy >=10 cohort evidence, so it only proves the voiceprint API, cohort refresh entrypoint, and raw-cosine fallback are usable; it must not claim the probe exercised the full AS-norm scoring path. Full AS-norm validation requires cohort size >=10. Public documentation records only the behavioral conclusion, not real task names, sample names, job IDs, speaker IDs, hosts, or paths.

Related Docs