MemoCMT: multimodal emotion recognition using cross-modal transformer-based feature fusion
Al Ain University · United Arab Emirates University · +5 more institutions
Abstract
Speech emotion recognition has seen a surge in transformer models, which excel at understanding the overall message by analyzing long-term patterns in speech. However, these models come at a computational cost. In contrast, convolutional neural networks are faster but struggle with capturing these long-range relationships. Our proposed system, MemoCMT, tackles this challenge using a novel "cross-modal transformer" (CMT). This CMT can effectively analyze local and global speech features and their corresponding text. To boost efficiency, MemoCMT leverages recent advancements in pre-trained models: HuBERT extracts meaningful features from the audio, while BERT analyzes the text. The core innovation lies in how…
Citation impact
- FWCI
- 99.35
- Percentile
- 100%
- References
- 47
Authors
5- MKMustaqeem Khan
Al Ain University, United Arab Emirates University
- PTPhuong-Nam Tran
Kyung Hee University
- NTNhat Truong Pham
Sungkyunkwan University
- AEAbdulmotaleb El Saddik
University of Ottawa, Al Ain University, United Arab Emirates University
- AOAlice OthmaniCorresponding
Université Paris-Est Créteil, Paris-Est Sup
Topics & keywords
- Modal
- Computer science
- Emotion recognition
- Transformer
- Fusion
- Pattern recognition (psychology)
- Artificial intelligence
- Speech recognition