Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering
Hangzhou Dianzi University · University of North Carolina at Charlotte · +1 more institution
Abstract
Visual question answering (VQA) is challenging because it requires a simultaneous understanding of both the visual content of images and the textual content of questions. The approaches used to represent the images and questions in a fine-grained manner and questions and to fuse these multimodal features play key roles in performance. Bilinear pooling based models have been shown to outperform traditional linear models for VQA, but their high-dimensional representations and high computational complexity may seriously limit their applicability in practice. For multimodal feature fusion, here we develop a Multi-modal Factorized Bilinear (MFB) pooling approach to efficiently and effectively combine multi-modal…
Citation impact
- FWCI
- 21.35
- Percentile
- 100%
- References
- 54
Authors
4Topics & keywords
- Pooling
- Bilinear interpolation
- Computer science
- Artificial intelligence
- Feature (linguistics)
- Question answering
- Modal
- Representation (politics)