Integrating Multimodal Information in Large Pretrained Transformers
University of Rochester · Age Institute
Abstract
Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While fine-tuning these pre-trained models is straight-forward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-face communication). Pre-trained models don't have the necessary components to accept two extra modalities of vision and acoustic. In this paper, we proposed an attachment to BERT and XLNet called…
Citation impact
- FWCI
- 32.10
- Percentile
- 100%
- References
- 34
Authors
7Topics & keywords
- Computer science
- Transformer
- Artificial intelligence
- Human–computer interaction
- Natural language processing
- Engineering
- Electrical engineering
- Quality Education