Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

Mao, Junhua; Xu, Wei; Yang, Yi; Wang, Jiang; Huang, Zhiheng; Yuille, Alan

doi:10.48550/arxiv.1412.6632

preprintarXiv (Cornell University)Dec 20, 2014GREEN OA

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

JMJunhua Mao WXWei Xu YYYi Yang JWJiang Wang ZHZhiheng Huang

University of California, Los Angeles · Baidu (China)

Indexed inarxivdatacite

Abstract

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, we apply the m-RNN…

Citation impact

651

total citations

FWCI: —
Percentile: —
References: 39

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Recurrent neural network
Computer science
Closed captioning
Benchmark (surveying)
Artificial intelligence
Convolutional neural network
Image (mathematics)
Ranking (information retrieval)

UN Sustainable Development Goals

Quality Education

No related works found for this paper.