Skip to content

kuai-lab/eaai25_complex_A2V

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Semantically Complex Audio to Video Generation with Audio Source Separation

Paper (Engineering Applications of Artificial Intelligence 2025, Journal)

figure2

  • Abstract: Recent advancements in artificial intelligence for audio-to-video generation have shown the ability to generate high-quality videos from audio, particularly by focusing on temporal semantics and magnitude. However, existing works struggle to capture all semantics from audio, as real world audios often consist of mixed sources, making it challenging to generate semantically aligned videos. To solve this problem, we present a novel multi-source audio-to-video generation framework that incorporates decomposed multiple audio sources into video generative models. Specifically, our proposed Attention Mosaic directly maps each decomposed audio feature to the corresponding spatial attention feature. In addition, our condition injection module is helpful for producing more natural contexts with non-audible objects by leveraging the knowledge of existing generative models. Our experiments show that the proposed framework achieves state-of-the-art performance in representing both multi- and single-source audio-to-video generation methods.

Getting Started

Installation

Our code is tested on Ubuntu 20.04 and cuda 11.8

  • Follow the steps below:
$ conda create --name Maestro python==3.10.0
$ conda activate Maestro
$ pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu118
$ pip install -r requirements.txt
$ pip install pyyaml omegaconf pytorch_lightning discord opencv-python einops timm decord pytorchvideo librosa kornia transformer
$ pip install open-clip-torch==2.24.0
$ pip install av==11.0.0
$ git clone https://github.com/facebookresearch/ImageBind.git

Clone the ImageBind repository, then replace the original imagebind_model.py and data.py with ./change/imagebind_model.py and ./change/data.py, respectively.

Download Pretrained Model

  1. Download Link : Condition Injection Module weights
$ mkdir checkpoints
$ cd checkpoints
$ mkdir cim

Place downloaded weights under "./checkpoints/cim" folder. (trained on VGGSound & Landscape dataset)

  1. Download Link(Video diffusion weights) : https://huggingface.co/VideoCrafter/VideoCrafter2/blob/main/model.ckpt
$ cd checkpoints
$ mkdir base_512_v2

Place downloaded weights under "./checkpoints/base_512_v2" folder.

Training Condition Injection Module

$ bash train.sh

Dataset Download

Preprocess the downloaded dataset as follows:

PROJECT_ROOT/dataset/
├── video_001/
│   ├── 00001.jpg
│   ├── 00002.jpg
│   ├── ...
│   ├── 0000N.jpg
│   ├── video_001.wav
├── video_002/
│   ├── 00001.jpg
│   ├── 00002.jpg
│   ├── ...
│   ├── 0000N.jpg
│   ├── video_002.wav
└── ...

Specify the dataset folder path(PROJECT_ROOT/dataset) for --data_dir

If you want to use custom datasets, only videos shorter than 10 seconds are allowed, and they should be prepared separately as frames and audio.

Inference

$ bash scripts/run.sh

The --pos option represents the position of the bounding box, and you should choose between "LR" (Left & Right) or "TD" (Top & Down).

Citation

@article{kim2025semantically,
  title={Semantically complex audio to video generation with audio source separation},
  author={Kim, Sieun and Jeong, Jaehwan and In, Sumin and Lee, Seung Hyun and Kim, Seungryong and Kim, Saerom and Baek, Wooyeol and Yoon, Sang Ho and Culurciello, Eugenio and Kim, Sangpil},
  journal={Engineering Applications of Artificial Intelligence},
  volume={149},
  pages={110457},
  year={2025},
  publisher={Elsevier}
}

Acknowlegement

Our code is based on several interesting and helpful projects:

About

Semantically Complex Audio to Video Generation with Audio Source Separation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors