MemoCMT: multimodal emotion recognition using cross-modal transformer-based feature fusion

Khan, Mustaqeem; Tran, Phuong-Nam; Pham, Nhat Truong; Saddik, Abdulmotaleb El; Othmani, Alice

doi:10.1038/s41598-025-89202-x

articleScientific ReportsFeb 14, 2025GOLD OA

MemoCMT: multimodal emotion recognition using cross-modal transformer-based feature fusion

MKMustaqeem Khan PTPhuong-Nam Tran NTNhat Truong Pham AEAbdulmotaleb El Saddik AOAlice Othmani

Al Ain University · United Arab Emirates University · +5 more institutions

PubMed

Indexed incrossrefdoajpubmed

Abstract

Speech emotion recognition has seen a surge in transformer models, which excel at understanding the overall message by analyzing long-term patterns in speech. However, these models come at a computational cost. In contrast, convolutional neural networks are faster but struggle with capturing these long-range relationships. Our proposed system, MemoCMT, tackles this challenge using a novel "cross-modal transformer" (CMT). This CMT can effectively analyze local and global speech features and their corresponding text. To boost efficiency, MemoCMT leverages recent advancements in pre-trained models: HuBERT extracts meaningful features from the audio, while BERT analyzes the text. The core innovation lies in how…

Citation impact

62

total citations

FWCI: 99.35
Percentile: 100%
References: 47

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Modal
Computer science
Emotion recognition
Transformer
Fusion
Pattern recognition (psychology)
Artificial intelligence
Speech recognition

No related works found for this paper.