preprintarXiv (Cornell University)Feb 10, 2015GREEN OA

Show, Attend and Tell: Neural Image Caption Generation with Visual\n Attention

Indexed inarxiv

Abstract

Inspired by recent work in machine translation and object detection, we\nintroduce an attention based model that automatically learns to describe the\ncontent of images. We describe how we can train this model in a deterministic\nmanner using standard backpropagation techniques and stochastically by\nmaximizing a variational lower bound. We also show through visualization how\nthe model is able to automatically learn to fix its gaze on salient objects\nwhile generating the corresponding words in the output sequence. We validate\nthe use of attention with state-of-the-art performance on three benchmark\ndatasets: Flickr8k, Flickr30k and MS COCO.\n

Citation impact

1,764
total citations
FWCI
Percentile
References
38
Citations per year

Authors

8

Topics & keywords

Keywords
  • Computer science
  • Benchmark (surveying)
  • Artificial intelligence
  • Gaze
  • Visualization
  • Object (grammar)
  • Salient
  • Sequence (biology)
No related works found for this paper.