An MLX audio experimentation hub for Apple Silicon — covering LoRA/QLoRA finetuning of TTS and speech-to-speech models, all runnable locally on M-series Macs.
Finetuning: Qwen3-TTS (0.6B / 1.7B), PersonaPlex 7B, CSM/Sesame, LFM 2.5 Audio 1.5B (ASR → function calling).
Inference: MiniMind-O (speech-to-speech, 118M), Indic Parler-TTS (Hindi TTS).
Use this when: you want the model to speak a new language or accent (e.g. Hindi). Voice identity still comes from a reference audio clip at inference — you are teaching the language, not baking in a specific voice.
python scripts/train.py --config configs/qwen3_tts_hindi.yamlHow it works:
- Applies LoRA adapters to all attention + MLP layers inside the
talkertransformer - Loss:
main_codec_loss + 0.3 × sub_talker_loss(teacher-forced codec prediction) - Does not use
ref_audioduring training — speaker identity is irrelevant - After training: load the adapter at inference with any ref audio to clone any voice in the new language
Good for: Hindi, regional languages, new accents, domain-specific speech styles
Use this when: you want to permanently bake a specific person's voice into the model so it speaks as that person without needing a reference clip at inference.
# Step 1 — train
python scripts/train.py --config configs/qwen3_tts_speaker.yaml
# Step 2 — bake the speaker identity into codec_embedding[3000]
python scripts/bake_speaker_embedding.py \
--config configs/qwen3_tts_speaker.yaml \
--checkpoint checkpoints/qwen3-speaker/checkpoint-final \
--output checkpoints/qwen3-speaker/custom_voice_modelHow it works (mirrors official sft_12hz.py):
- Every training sample must have a
ref_audiofield pointing to a clip of the target speaker - During each forward pass,
model.speaker_encoderextracts a speaker embedding from the ref audio mel spectrogram - The speaker embedding is injected as a single positional token into the codec prefix (between the think/lang section and
[pad, bos]) — matches the officialsft_12hz.pypositional injection approach - LoRA adapters learn to generate codec tokens conditioned on that speaker identity
- After training,
bake_speaker_embedding.pyaverages the speaker embeddings across the training set and writes the mean intotalker.model.codec_embedding.weight[3000]— the reserved custom-voice slot — then patchesconfig.jsonwithtts_model_type: custom_voice
Good for: cloning a single specific voice, podcast/audiobook voice replication, personal TTS
| Language Adaptation | Speaker Voice Cloning | |
|---|---|---|
| Config | qwen3_tts_hindi.yaml |
qwen3_tts_speaker.yaml |
| model_type | qwen3_tts |
qwen3_tts_speaker |
| ref_audio in data | Not required | Required for every sample |
| Epochs | 10–15 | 3 (more risks overfitting) |
| LoRA rank | 8 | 16 |
| Effective batch | 32 | 16 |
| Post-training step | None | bake_speaker_embedding.py |
| At inference | Needs ref audio to pick a voice | Works without ref audio |
You can combine both: train with the language config first, then run the speaker config on top using the language-adapted checkpoint as the starting point.
# Install dependencies
pip install mlx-audio soundfile scipy datasets transformers gradio pyyaml safetensors
# Verify pipeline works (no data needed)
python scripts/train.py --config configs/qwen3_tts_hindi.yaml --smoke-test# 1. Download Hindi dataset
python scripts/prepare_hindi_dataset.py --source hf --output data/hindi
# 2. Pre-tokenize audio to codec IDs (run once — saves .codec.npy files)
python scripts/preprocess_dataset.py --input data/hindi/train.jsonl data/hindi/val.jsonl
# 3. Train
python scripts/train.py --config configs/qwen3_tts_hindi.yaml
# 4. Demo — compare before/after with a reference voice
python scripts/demo.py --adapter checkpoints/qwen3-hindi/checkpoint-bestYour JSONL must have ref_audio on every line, all pointing to the same target speaker:
{"audio": "data/speaker/clip1.wav", "text": "Hello world", "ref_audio": "data/speaker/ref.wav"}
{"audio": "data/speaker/clip2.wav", "text": "How are you", "ref_audio": "data/speaker/ref.wav"}# 1. Pre-tokenize audio
python scripts/preprocess_dataset.py \
--input data/speaker/train.jsonl data/speaker/val.jsonl
# 2. Train (3 epochs recommended)
python scripts/train.py --config configs/qwen3_tts_speaker.yaml
# 3. Bake speaker embedding into the model
python scripts/bake_speaker_embedding.py \
--config configs/qwen3_tts_speaker.yaml \
--checkpoint checkpoints/qwen3-speaker/checkpoint-final \
--output checkpoints/qwen3-speaker/custom_voice_model
# 4. Use the baked model at inference (no ref_audio needed)
# The output dir contains adapters.safetensors + speaker_embedding.npymlx-audio-train/
│
├── scripts/
│ ├── train.py # Main finetuning entry point
│ ├── bake_speaker_embedding.py # Post-training: write speaker → codec_embedding[3000]
│ ├── preprocess_dataset.py # Pre-tokenize audio → .codec.npy files
│ ├── prepare_hindi_dataset.py # Download & format Hindi TTS data
│ └── demo.py # Gradio voice-cloning demo
│
├── configs/
│ ├── qwen3_tts_hindi.yaml # Pipeline 1: language adaptation
│ ├── qwen3_tts_speaker.yaml # Pipeline 2: speaker voice cloning
│ └── qwen3_tts_multilingual.yaml # Pipeline 3: multilingual (hi/ta/te/kn/mr)
│
├── train/
│ ├── lora.py # LoRA / QLoRA layer implementations
│ ├── trainer.py # Training loop (AdamW, grad accum, checkpointing)
│ └── losses/
│ └── codec_loss.py # qwen3_tts_loss, qwen3_tts_speaker_loss, csm_loss
│
├── data/
│ ├── audio_utils.py # Audio I/O, resampling, mel spectrogram
│ ├── base_dataset.py # JSONL dataset + BatchIterator (prefetch, length-sort)
│ └── processors/
│ ├── qwen3_tts.py # audio→codec tokens, text→IDs, ref_mel extraction
│ └── csm.py # CSM: Mimi RVQ codes + LLaMA tokenizer
│
└── checkpoints/ # Saved LoRA adapters
python scripts/train.py --config CONFIG [OPTIONS]
Options:
--smoke-test Run 5 steps with dummy data (no dataset needed)
--resume PATH Resume from a checkpoint directory
--lora-rank N Override LoRA rank
--lr FLOAT Override learning rate
--epochs N Override num_epochs
--max-steps N Stop after N optimizer stepspython scripts/bake_speaker_embedding.py \
--config CONFIG_YAML \
--checkpoint CKPT_DIR \
--output OUTPUT_DIR \
[--slot N] \ # codec_embedding row to use (default 3000)
[--fuse-lora] # merge LoRA into base weightsPre-tokenizes audio to .codec.npy files once, so training skips the speech tokenizer entirely (major speedup + avoids OOM).
python scripts/preprocess_dataset.py --input data/hindi/train.jsonl [data/hindi/val.jsonl ...]LoRALinear— wrapsnn.Linear(full-precision base)QLoRALinear— wrapsnn.QuantizedLinear(quantized base + bf16 LoRA delta)apply_lora(model, config)— recursive in-place patching; scoped totalkerfor Qwen3-TTSget_trainable_params(model)— returns onlylora_a/lora_btensors (avoids 40+ GB gradient memory)save_adapters / load_adapters— tiny safetensors checkpoint (23 MB for rank-8, 45 MB for rank-16)
- AdamW + gradient accumulation + gradient norm clipping
- Cosine / linear / constant LR schedule with warmup
- Per-step and per-epoch checkpointing (LoRA adapters only)
- Batched gradient norm computation (single GPU sync per optimizer step)
| Function | Pipeline | Description |
|---|---|---|
qwen3_tts_loss |
Language Adaptation | main_loss + 0.3 × sub_talker_loss; no speaker conditioning |
qwen3_tts_speaker_loss |
Speaker Cloning | Same loss + speaker embedding injected from ref_mel; all 16 codec levels as input context |
csm_loss |
CSM | Cross-entropy on first codebook head |
No PyTorch dependency — uses soundfile + scipy.signal.
load_audio(path, target_sr)— load + resample any audio filenormalize_loudness(audio)— RMS normalization to −23 dBFStrim_silence(audio)— energy-based silence trimmingmel_spectrogram(audio, sr)— log-mel spectrogram matching Qwen3-TTS speaker encoder params (n_fft=1024, n_mels=128, hop=256)
sort_by_length=True— groups similar-length sequences to minimise padding wasteprefetch=2— loads the next N batches in a background thread, overlapping CPU I/O with GPU compute
| Key | Value | Notes |
|---|---|---|
model.model_type |
qwen3_tts |
|
lora.rank |
8 |
|
trainer.num_epochs |
15 |
|
trainer.batch_size |
2 |
M-series Metal pressure |
trainer.grad_accumulation |
16 |
effective batch = 32 |
trainer.learning_rate |
2e-5 |
|
trainer.label_smoothing |
0.1 |
helps with character diversity |
processor.include_ref_mel |
false |
not needed for language training |
| Key | Value | Notes |
|---|---|---|
model.model_type |
qwen3_tts_speaker |
enables speaker loss |
lora.rank |
16 |
more capacity for speaker detail |
trainer.num_epochs |
3 |
more epochs → overfitting risk |
trainer.grad_accumulation |
8 |
effective batch = 16 |
trainer.learning_rate |
2e-5 |
matches official |
trainer.label_smoothing |
0.0 |
sharp speaker predictions |
processor.include_ref_mel |
true |
enables mel → speaker_encoder path |
processor.speaker_name |
custom_speaker |
written into config.json by bake script |
These models are trainable via scripts/train.py with a YAML config:
| Model | HF ID | model_type |
Status |
|---|---|---|---|
| Qwen3-TTS 0.6B | mlx-community/Qwen3-TTS-12Hz-0.6B-Base-8bit |
qwen3_tts / qwen3_tts_speaker |
Working |
| Qwen3-TTS 1.7B | mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit |
qwen3_tts / qwen3_tts_speaker |
Working (same pipeline, change model_id) |
| PersonaPlex 7B | mlx-community/personaplex |
personaplex |
Working — full-duplex, Hindi LoRA adaptation |
| CSM / Sesame | mlx-community/csm-1b |
csm |
Processor + loss implemented |
| Kokoro | — | — | Planned |
| Chatterbox | — | — | Planned |
These live under models/ and have their own scripts. They are not trained via scripts/train.py.
Compact 118M speech-to-speech model ported from jingyaogong/minimind-o. Thinker (LM) + Talker (acoustic head) + SenseVoice + Mimi. Optionally handles image input via SigLIP2.
| Input | Output | Notes |
|---|---|---|
| Speech (mic) | Speech | Push-to-talk demo, SenseVoice → Mimi |
| Text | Text + Speech | Thinker generates text; Talker generates Mimi codes simultaneously |
| Image + Text | Text + Speech | SigLIP2-p32-256 (64 patch tokens), optional |
# First run: downloads + converts weights automatically
python scripts/minimind_o_test_text.py
# Web demo (chat UI with streaming audio + sample images for vision)
pip install flask flask-cors flask-sock
python scripts/minimind_o_web_demo.py # open http://localhost:7860
# Push-to-talk mic demo (terminal)
python scripts/minimind_o_mic_demo.py
# Verify weight conversion is correct
python scripts/minimind_o_verify_alignment.py
# Batch inference (text / audio / voice-clone modes)
python scripts/minimind_o_eval.py --model_dir out/ --mode 0See models/minimind_o/README.md for full documentation and the multilingual → full-duplex development roadmap.
MLX port of Indic Parler-TTS optimised for Apple Silicon. Fixes sampling bugs from the upstream port and applies @mx.compile-decorated top-k/top-p kernels from mlx_lm for faster generation.
| Input | Output | Notes |
|---|---|---|
| Text (Hindi + description prompt) | Speech | Parler-style description-conditioned TTS |
Performance after optimisation:
- Short sentence: 2.5s audio in 4.1s (was 0.3s audio in 7.7s — broken loop)
- Long sentence: 5.1s audio in 3.9s — generation time now scales with content length
from models.indic_parler_tts import load_model, generate
model, tokenizer = load_model("ai4bharat/indic-parler-tts")
audio = generate(model, tokenizer, text="आज कैसे हो?", description="A female speaker...")PersonaPlex is a 7B full-duplex speech model (simultaneous ASR + TTS). This repo supports LoRA finetuning it for Hindi language adaptation.
# 1. Prepare paired Hindi data (manifest.json + tokens/*.npz)
python scripts/prepare_personaplex_dataset.py
# 2. Train
python scripts/train.py --config configs/personaplex_hindi.yaml
# 3. Run full-duplex inference (in the personaplex-mlx repo)
python -m personaplex_mlx.local_web \
--adapter-weights /path/to/checkpoints/personaplex-hindi/checkpoint-best/adapters.npz| Component | Trainable | Size | Notes |
|---|---|---|---|
Transformer attention (in_proj, out_proj) |
LoRA A/B only | ~29 MB | rank-16 |
Depformer bridge (linear_in) |
LoRA A/B only | small | bridges transformer→depformer; must match training at inference |
Audio embeddings (audio_embs) |
Full weights | ~268 MB | 16 codebooks × 2049 tokens |
Vocab projection (text_linear) |
Frozen | 0 MB | see note below |
Output norm (out_norm) |
Full weights | tiny |
The text_linear layer is a (32000 × 4096) = 131M parameter vocabulary projection. Training it on a small dataset (~1K samples) causes severe overfitting — the model memorises training text patterns and produces repetitive garbage (commas, byte tokens) at inference.
Always set freeze_text_linear: true in the config unless you have 10K+ samples:
lora:
rank: 16
alpha: 16.0
dropout: 0.0
train_depformer: false
freeze_text_linear: true # freeze 131M vocab projection — prevents overfitting on small datasetsWith freeze_text_linear: true, adapter size drops from ~560 MB to ~297 MB and inference quality is significantly better on small datasets.
| Samples | Expected quality |
|---|---|
| ~1K | Overfits; fails on unseen Hindi phonemes |
| ~3–5K | Reasonable Hindi phoneme coverage |
| ~10K+ | Solid generalisation |
Use scripts/download_indicvoices.py or scripts/download_common_voice.py to pull more Hindi data.
The official sft_12hz.py is full fine-tuning on CUDA with PyTorch + HuggingFace Accelerate. This repo is LoRA / QLoRA on Apple Silicon with MLX.
| Official | This repo | |
|---|---|---|
| Framework | PyTorch + Accelerate | MLX |
| Training mode | Full fine-tune (all weights) | LoRA (1–2% of weights) |
| Precision | bfloat16 | 8-bit quantised base + bf16 LoRA delta |
| Effective batch | 8 (bs=2 × accum=4) | 16–32 |
| Epochs | 3 | 3 (speaker) / 10–15 (language) |
| Speaker injection | Position 6 in dual-channel format | Positional token in codec prefix (matches official) |
| All 16 codec levels | Yes (additive input embeddings) | No (first codebook only as input; code_predictor is an auxiliary loss head) |
| Post-training step | Bake speaker to codec_embedding[3000] |
bake_speaker_embedding.py (same) |
| Checkpoint size | Full model (~1.2 GB) | Adapters only (~46 MB) |
LFM 2.5 Audio is a hybrid Mamba+Transformer 1.5B speech-to-speech model from LiquidAI. This repo adds a LoRA fine-tuning pipeline that teaches it to output structured function calls from audio input — e.g. spoken "switch on the hall light" → HassLightTurnOn|$area=hall.
# 200-step smoke test (~5 min on M-series)
python scripts/train.py --config configs/lfm_audio_asr_test.yaml
# Full 10k-step run (~12 hours on M-series, keep Mac awake)
caffeinate -i python scripts/train.py --config configs/lfm_audio_asr_10k.yaml
# Evaluate against the OHF-Voice test split
python scripts/lfm_asr_eval.py \
--adapter checkpoints/lfm-audio-asr-10k/checkpoint-final \
--samples-per-function 10
# Interactive demo
python scripts/lfm_asr_demo.py --adapter checkpoints/lfm-audio-asr-10k/checkpoint-final
# open http://localhost:7860Evaluated on Paulescu/OHF-Voice-audio-20260504 test split, 205 samples across 41 functions.
| Metric | LoRA (this repo) | Full FT (LiquidAI cookbook, A100) |
|---|---|---|
| Format compliance | 99.0% | 99.7% |
| Function-name accuracy | 82.0% | 98.8% |
| Argument accuracy | 61.0% | ~97% |
| Trainable params | 884K / 169M (0.52%) | 100% |
| Hardware | Apple Silicon | A100 80GB |
| Training time | ~12 hours | ~2 hours |
Pre-trained adapter: akashicmarga/LFM2.5-Audio-1.5B-ASR-LoRA
Critical implementation detail: LFM 2.5 Audio's model.__call__ concatenates audio embeddings after the text sequence. For ASR fine-tuning this is wrong — the model can't attend to audio when predicting the assistant tokens. You must call model._prefill with explicit modalities so audio is placed in the user turn (before the assistant), matching the inference-time ChatState layout. Without this fix, val_loss looks reasonable but the model never produces function calls.
Experiments worth running:
- Higher LoRA rank (32, 64): more capacity for exact argument patterns; expect better arg accuracy
- Full fine-tuning: remove LoRA and unfreeze all params — needs ~16 GB RAM minimum
- More data: OHF-Voice has 55K samples; we used 950. Scale up for large accuracy gains
- Longer training: the model was still improving at 10k steps; 20–30k may close the gap further
- LoRA on audio encoder: currently frozen; unfreezing may help with accented speech
- Hybrid Mamba + Transformer: attention at layers 2,5,8,10,12,14; Mamba elsewhere
- LoRA applies only to attention layers (18 patched layers)
- Audio path: 16kHz mel → ConformerEncoder (3× stride-2) → AUDIO_IN tokens in sequence
- Text path: standard tokenizer, chat template with
<|im_start|>/<|im_end|> - LoRA
_strip_emptybug: Mamba layers produce{}gradients — must preserve list length or optimizer.update misaligns indices