preprintarXiv (Cornell University)Dec 20, 2014GREEN OA

Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN)

University of California, Los Angeles · Baidu (China)

Indexed inarxivdatacite

Abstract

In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions. It directly models the probability distribution of generating a word given previous words and an image. Image captions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on four benchmark datasets (IAPR TC-12, Flickr 8K, Flickr 30K and MS COCO). Our model outperforms the state-of-the-art methods. In addition, we apply the m-RNN…

Citation impact

651
total citations
FWCI
Percentile
References
39
Citations per year

Authors

6

Topics & keywords

Keywords
  • Recurrent neural network
  • Computer science
  • Closed captioning
  • Benchmark (surveying)
  • Artificial intelligence
  • Convolutional neural network
  • Image (mathematics)
  • Ranking (information retrieval)
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.

Funding