Deep Modular Co-Attention Networks for Visual Question Answering
Hangzhou Dianzi University · Hồng Đức University · +4 more institutions
Abstract
Visual Question Answering (VQA) requires a fine-grained and simultaneous understanding of both the visual content of images and the textual content of questions. Therefore, designing an effective `co-attention' model to associate key words in questions with key objects in images is central to VQA performance. So far, most successful attempts at co-attention learning have been achieved by using shallow models, and deep co-attention models show little improvement over their shallow counterparts. In this paper, we propose a deep Modular Co-Attention Network (MCAN) that consists of Modular Co-Attention (MCA) layers cascaded in depth. Each MCA layer models the self-attention of questions and images, as well as the…
Citation impact
- FWCI
- 44.50
- Percentile
- 100%
- References
- 54
Authors
5Topics & keywords
- Modular design
- Computer science
- Question answering
- Benchmark (surveying)
- Attention network
- Key (lock)
- Artificial intelligence
- Deep learning
- Quality Education