Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
University of California, Berkeley · Berkeley College · +1 more institution
Abstract
Modeling textual or visual information with vector representations trained from large language or visual datasets has been successfully explored in recent years. However, tasks such as visual question answering require combining these vector representations with each other. Approaches to multimodal pooling include element-wise product or sum, as well as concatenation of the visual and textual representations. We hypothesize that these methods are not as expressive as an outer product of the visual and textual vectors. As the outer product is typically infeasible due to its high dimensionality, we instead propose utilizing Multimodal Compact Bilinear pooling (MCB) to efficiently and expressively combine…
Citation impact
- FWCI
- 72.14
- Percentile
- 100%
- References
- 69
Authors
6- AFAkira FukuiCorresponding
University of California, Berkeley, Berkeley College
- DHDong Huk Park
University of California, Berkeley, Berkeley College
- DYDaylen Yang
University of California, Berkeley, Berkeley College
- ARAnna Rohrbach
University of California, Berkeley, Berkeley College, Max Planck Institute for Informatics
- TDTrevor Darrell
University of California, Berkeley, Berkeley College
Topics & keywords
- Pooling
- Question answering
- Computer science
- Bilinear interpolation
- Artificial intelligence
- Ground
- Computer vision
- Engineering
- Quality Education