Emotion Recognition in Conversation (ERC) enables machines to respond emotionally to users' requests, which is crucial in human-computer interaction. Existing methods concentrate on capturing context and speaker-dependent relationships within the text modality or emphasize multi-modal interaction and feature fusion. However, these methods fail to effectively align shared emotional representations across modalities and preserve unique emotional features within each modality. To address this challenge, we propose an end-to-end framework called Cross-Modal Shared Attention (ShareERC). Specifically, we first extract context-aware features from text and audio, aggregating emotional information within each modality. Leveraging this aggregated emotional information, we design a Cross-Modal Shared Attention module that aligns shared features across modalities in the feature space while preserving the distinct features of each modality. Extensive experiments on the MELD and IEMOCAP datasets demonstrate the effectiveness and superiority of the proposed ShareERC.
The enviroment will be updated when this paper is accepted.