Show, Attend and Tell: Neural Image Caption Generation with Visual\n Attention

Xu, Kelvin; Ba, Jimmy; Kiros, Ryan; Cho, Kyunghyun; Courville, Aaron; Salakhutdinov, Ruslan; Zemel, Richard S.; Bengio, Yoshua

doi:10.48550/arxiv.1502.03044

preprintarXiv (Cornell University)Feb 10, 2015GREEN OA

Show, Attend and Tell: Neural Image Caption Generation with Visual\n Attention

KXKelvin Xu JBJimmy Ba RKRyan Kiros KCKyunghyun Cho ACAaron Courville

Indexed inarxiv

Abstract

Inspired by recent work in machine translation and object detection, we\nintroduce an attention based model that automatically learns to describe the\ncontent of images. We describe how we can train this model in a deterministic\nmanner using standard backpropagation techniques and stochastically by\nmaximizing a variational lower bound. We also show through visualization how\nthe model is able to automatically learn to fix its gaze on salient objects\nwhile generating the corresponding words in the output sequence. We validate\nthe use of attention with state-of-the-art performance on three benchmark\ndatasets: Flickr8k, Flickr30k and MS COCO.\n

Citation impact

1,764

total citations

FWCI: —
Percentile: —
References: 38

Citations per year

Authors

8

Topics & keywords

Topics

Keywords

Computer science
Benchmark (surveying)
Artificial intelligence
Gaze
Visualization
Object (grammar)
Salient
Sequence (biology)

No related works found for this paper.