Semantically Complex Audio to Video Generation with Audio Source Separation

Paper (Engineering Applications of Artificial Intelligence 2025, Journal)

Abstract: Recent advancements in artificial intelligence for audio-to-video generation have shown the ability to generate high-quality videos from audio, particularly by focusing on temporal semantics and magnitude. However, existing works struggle to capture all semantics from audio, as real world audios often consist of mixed sources, making it challenging to generate semantically aligned videos. To solve this problem, we present a novel multi-source audio-to-video generation framework that incorporates decomposed multiple audio sources into video generative models. Specifically, our proposed Attention Mosaic directly maps each decomposed audio feature to the corresponding spatial attention feature. In addition, our condition injection module is helpful for producing more natural contexts with non-audible objects by leveraging the knowledge of existing generative models. Our experiments show that the proposed framework achieves state-of-the-art performance in representing both multi- and single-source audio-to-video generation methods.

Getting Started

Installation

Our code is tested on Ubuntu 20.04 and cuda 11.8

Follow the steps below:

$ conda create --name Maestro python==3.10.0
$ conda activate Maestro
$ pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu118
$ pip install -r requirements.txt
$ pip install pyyaml omegaconf pytorch_lightning discord opencv-python einops timm decord pytorchvideo librosa kornia transformer
$ pip install open-clip-torch==2.24.0
$ pip install av==11.0.0
$ git clone https://github.com/facebookresearch/ImageBind.git

Clone the ImageBind repository, then replace the original imagebind_model.py and data.py with ./change/imagebind_model.py and ./change/data.py, respectively.

Download Pretrained Model

Download Link : Condition Injection Module weights

$ mkdir checkpoints
$ cd checkpoints
$ mkdir cim

Place downloaded weights under "./checkpoints/cim" folder. (trained on VGGSound & Landscape dataset)

Download Link(Video diffusion weights) : https://huggingface.co/VideoCrafter/VideoCrafter2/blob/main/model.ckpt

$ cd checkpoints
$ mkdir base_512_v2

Place downloaded weights under "./checkpoints/base_512_v2" folder.

Training Condition Injection Module

$ bash train.sh

Dataset Download

VGGSound : https://github.com/hche11/VGGSound
Landscape : https://kuai-lab.github.io/eccv2022sound/

Preprocess the downloaded dataset as follows:

PROJECT_ROOT/dataset/
├── video_001/
│   ├── 00001.jpg
│   ├── 00002.jpg
│   ├── ...
│   ├── 0000N.jpg
│   ├── video_001.wav
├── video_002/
│   ├── 00001.jpg
│   ├── 00002.jpg
│   ├── ...
│   ├── 0000N.jpg
│   ├── video_002.wav
└── ...

Specify the dataset folder path(PROJECT_ROOT/dataset) for --data_dir

If you want to use custom datasets, only videos shorter than 10 seconds are allowed, and they should be prepared separately as frames and audio.

Inference

$ bash scripts/run.sh

The --pos option represents the position of the bounding box, and you should choose between "LR" (Left & Right) or "TD" (Top & Down).

Citation

@article{kim2025semantically,
  title={Semantically complex audio to video generation with audio source separation},
  author={Kim, Sieun and Jeong, Jaehwan and In, Sumin and Lee, Seung Hyun and Kim, Seungryong and Kim, Saerom and Baek, Wooyeol and Yoon, Sang Ho and Culurciello, Eugenio and Kim, Sangpil},
  journal={Engineering Applications of Artificial Intelligence},
  volume={149},
  pages={110457},
  year={2025},
  publisher={Elsevier}
}

Acknowlegement

Our code is based on several interesting and helpful projects:

VideoCrafter : https://github.com/AILab-CVC/VideoCrafter
ImageBind : https://github.com/facebookresearch/ImageBind
TrailBlazer : https://github.com/hohonu-vicml/Trailblazer
Perceiver : https://github.com/lucidrains/perceiver-pytorch

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
TrailBlazer		TrailBlazer
audios		audios
change		change
configs		configs
lvdm		lvdm
scripts		scripts
train		train
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
train.sh		train.sh
train_config.yaml		train_config.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Semantically Complex Audio to Video Generation with Audio Source Separation

Getting Started

Installation

Download Pretrained Model

Training Condition Injection Module

Dataset Download

Inference

Citation

Acknowlegement

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Semantically Complex Audio to Video Generation with Audio Source Separation

Getting Started

Installation

Download Pretrained Model

Training Condition Injection Module

Dataset Download

Inference

Citation

Acknowlegement

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages