| 2025-09 |
RFM-Editing: Rectified Flow Matching for Text-guided Audio Editing |
 |
| 2025-09 |
Continuous Audio Language Models |
 |
| 2025-09 |
DreamAudio: Customized Text-to-Audio Generation with Diffusion Models |
 |
| 2025-09 |
PicoAudio2: Temporal Controllable Text-to-Audio Generation with Natural Language Description |
 |
| 2025-08 |
AudioStory: Generating Long-Form Narrative Audio with Large Language Models |
 |
| 2025-07 |
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models |
 |
| 2025-07 |
DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment |
 |
| 2025-06 |
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model |
 |
| 2025-05 |
AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion |
 |
| 2025-05 |
From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data |
 |
| 2025-05 |
T2A-Feedback: Improving Basic Capabilities of Text-to-Audio Generation via Fine-grained AI Feedback |
 |
| 2025-05 |
Fast Text-to-Audio Generation with Adversarial Post-Training |
 |
| 2025-02 |
AudioGenX: Explainability on Text-to-Audio Generative Models |
 |
| 2025-01 |
Fugatto 1: Foundational Generative Audio Transformer Opus 1 |
 |
| 2024-12 |
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization |
 |
| 2024-12 |
Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations |
 |
| 2024-11 |
Audiobox TTA-RAG: Improving Zero-Shot and Few-Shot Text-To-Audio with Retrieval-Augmented Generation |
 |
| 2024-10 |
FlashAudio: Rectified Flows for Fast and High-Fidelity Text-to-Audio Generation |
 |
| 2024-09 |
Text2FX: Harnessing CLAP Embeddings for Text-Guided Audio Effects |
 |
| 2024-09 |
PTQ4ADM: Post-Training Quantization for Efficient Text Conditional Audio Diffusion Models |
 |
| 2024-09 |
AudioComposer: Towards Fine-grained Audio Generation with Natural Language Descriptions |
 |
| 2024-09 |
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer |
 |
| 2024-08 |
MorphFader: Enabling Fine-grained Controllable Morphing with Text-to-Audio Models |
 |
| 2024-07 |
Stable Audio Open |
 |
| 2024-07 |
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation |
 |
| 2024-06 |
Taming Data and Transformers for Audio Generation |
 |
| 2024-06 |
Improving Text-To-Audio Models with Synthetic Captions |
 |
| 2024-06 |
UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner |
 |
| 2024-06 |
LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation |
 |
| 2024-06 |
AudioLCM: Text-to-Audio Generation with Latent Consistency Models |
 |
| 2024-05 |
SoundCTM: Unifying Score-based and Consistency Models for Full-band Text-to-Sound Generation |
 |
| 2024-04 |
Tango 2: Aligning Diffusion-based Text-to-Audio Generations through Direct Preference Optimization |
 |
| 2024-02 |
Fast Timing-Conditioned Latent Audio Diffusion |
 |
| 2024-02 |
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities |
 |
| 2024-01 |
Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation |
 |
| 2023-12 |
Audiobox: Unified Audio Generation with Natural Language Prompts |
 |
| 2023-10 |
UniAudio: An Audio Foundation Model Toward Universal Audio Generation |
 |
| 2023-09 |
Retrieval-Augmented Text-to-Audio Generation |
 |
| 2023-09 |
NExT-GPT: Any-to-Any Multimodal LLM |
 |
| 2023-08 |
Audio Generation with Multiple Conditional Diffusion Model |
 |
| 2023-08 |
AudioLDM 2: Learning Holistic Audio Generation With Self-Supervised Pretraining |
 |
| 2023-05 |
Make-An-Audio 2: Temporal-Enhanced Text-to-Audio Generation |
 |
| 2023-05 |
Any-to-Any Generation via Composable Diffusion |
 |
| 2023-04 |
Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model |
 |
| 2023-04 |
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head |
 |
| 2023-01 |
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models |
 |
| 2023-01 |
AudioLDM: Text-to-Audio Generation with Latent Diffusion Models |
 |
| 2022-10 |
Full-band General Audio Synthesis with Score-based Diffusion |
 |
| 2022-09 |
AudioGen: Textually Guided Audio Generation |
 |
| 2022-09 |
AudioLM: a Language Modeling Approach to Audio Generation |
 |
| 2022-07 |
Diffsound: Discrete Diffusion Model for Text-to-sound Generation |
 |
| 2022-02 |
General-purpose, long-context autoregressive modeling with Perceiver AR |
 |
| 2021-07 |
Conditional Sound Generation Using Neural Discrete Time-Frequency Representation Learning |
 |
| 2021-02 |
On Generative Spoken Language Modeling from Raw Audio |
 |
| 2020-09 |
DiffWave: A Versatile Diffusion Model for Audio Synthesis |
 |
| 2020-09 |
WaveGrad: Estimating Gradients for Waveform Generation |
 |
| 2019-10 |
MelGAN: Generative Adversarial Networks for Conditional Waveform Synthesis |
 |
| 2019-05 |
Acoustic Scene Generation with Conditional Samplernn |
 |
| 2018-02 |
Efficient Neural Audio Synthesis |
 |
| 2016-09 |
WaveNet: A Generative Model for Raw Audio |
 |