articleScientific ReportsFeb 14, 2025GOLD OA

MemoCMT: multimodal emotion recognition using cross-modal transformer-based feature fusion

Al Ain University · United Arab Emirates University · +5 more institutions

PubMed
Indexed incrossrefdoajpubmed

Abstract

Speech emotion recognition has seen a surge in transformer models, which excel at understanding the overall message by analyzing long-term patterns in speech. However, these models come at a computational cost. In contrast, convolutional neural networks are faster but struggle with capturing these long-range relationships. Our proposed system, MemoCMT, tackles this challenge using a novel "cross-modal transformer" (CMT). This CMT can effectively analyze local and global speech features and their corresponding text. To boost efficiency, MemoCMT leverages recent advancements in pre-trained models: HuBERT extracts meaningful features from the audio, while BERT analyzes the text. The core innovation lies in how…

No related works found for this paper.