MultiModal Emotion Recognition using Cross modal Attention module and Contrastive loss
- Data: KEMDy19
- Modality: Audio, Text
- Linux
- Python 3.8.16
- PyTorch 1.13.1 and CUDA 11.7
a. Create a conda virtual environment and activate it.
conda create -n MER python=3.8
conda activate MERb. Install PyTorch and torchvision following the official instructions
c. Clone this repository.
d. Install requirments.
pip install -r requirements.txte. Install DeepSpeed
First you need libaio-dev. please install by
sudo apt-get install libaio-devAfter this, install deepspeed by
DS_BUILD_CPU_ADAM=1 DS_BUILD_FUSED_ADAM=1 DS_BUILD_UTILS=1 DS_BUILD_AIO=1 pip install deepspeed==0.9.0 --global-option="build_ext" --global-option="-j11" --no-cache-dirPlease check for detail installation DeepSpeed official github
a. Prepare data
- root_path: original KEMD19 path Ex) /home/ubuntu/data/KEMD_19/
- save_path: save folder, default: ./data/
python preprocess.py --root_path your_KEMD_19_path --save_path ./data/Here is the preprocess flow chart.
Note that, wav_length cliping is conducted in train_hf.sh or inference.py
Run Training code
bash train_hf.shCheck your GPU, and change train_hf.sh and configs properly.
You can run tensorboard
tensorboard --logdir ./output/log/tensorboard_what_you_want/version_0/Because this repository use deepspeed stage 2, model weights sharded between gpus. So you need to make sharded checkpoints as one. You need to collate the model weights using
python make_model_weights.pyAfter this,
CUDA_VISIBLE_DEVICES=0 python inference.pyIn table, CE means cross entropy and CA means contrastive loss repectively.
Multimodal(CAT) represents using concatenate for multimodal modeling and Multimodal(CMA) represents using cross modal attention respectively.



