ICASSP 2026
Paper | Models | PyPI Package | Training Code | Notebooks
S-SONDO is the first framework for self-supervised knowledge distillation of general audio foundation models. It distills large teacher models into lightweight students that are up to 61x smaller while retaining up to 96% of teacher performance, using only output embeddings, no logits or layer-level alignment required.
Fig. 1. Overview of the proposed S-SONDO framework. The student embeddings are mapped and aligned with the teacher embeddings in the teacher's latent space through self-supervised knowledge distillation.
Downstream evaluation across 7 audio tasks (4 music + 3 environmental sound). Students retain up to 96.4% of teacher performance while being up to 61x smaller.
Table 1. Downstream evaluation of S-SONDO with 95% Confidence Intervals (CI). We report the performance of our Knowledge Distillation method across teacher-student combinations. For each student model, supervised training results are reported as a reference (lines where MobileNetV3, DyMN, and ERes2Net have no teacher model). Bold values indicate the best result for each student between supervised and distillation training. Greyed values correspond to teacher performance, and green numbers denote the percentage of teacher performance achieved by the student.
Fig. 2. Ablation on the number of clusters for the Balanced Data Sampling. The fixed dashed line is the random sampling baseline.
This repository is organized into three main folders:
| Folder | Description |
|---|---|
inference_ssondo/ |
PyPI package (pip install ssondo) — lightweight inference and finetuning with pretrained S-SONDO models. Auto-downloads checkpoints from Hugging Face Hub. |
training_ssondo/ |
Training pipeline — full 4-step workflow to reproduce the paper: download AudioSet, extract teacher embeddings, cluster, and train student models via knowledge distillation. One-command setup with ./setup.sh. |
notebooks/ |
Evaluation notebooks — clustering analysis (t-SNE, UMAP, NMI) and linear probe / finetuning on ESC-50. Uses ssondo from PyPI, no local setup needed. |
If you use S-SONDO in your research, please cite:
@inproceedings{eladlouni2026ssondo,
title={S-SONDO: Self-Supervised Knowledge Distillation for General Audio Foundation Models},
author={El Adlouni, Mohammed Ali and Quelennec, Aurian and Chouteau, Pierre and Peeters, Geoffroy and Essid, Slim},
booktitle={IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
year={2026}
}This project is licensed under the MIT License. See LICENSE for details.
- MATPAC — Teacher model
- M2D — Teacher model
- EfficientAT — Student architectures (MobileNetV3, DyMN)
- AudioSet — Training data

