Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

Kiros, Ryan; Salakhutdinov, Ruslan; Zemel, Richard S.

doi:10.48550/arxiv.1411.2539

preprintarXiv (Cornell University)Nov 10, 2014GREEN OA

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models

RKRyan Kiros RSRuslan Salakhutdinov RSRichard S. Zemel

University of Toronto

Indexed inarxivdatacite

Abstract

Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embedding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding models with multimodal neural language models. We introduce the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the encoder. The encoder allows one to rank images and sentences while the decoder can generate novel descriptions from scratch. Using LSTM to encode sentences, we match…

Citation impact

1,323

total citations

FWCI: —
Percentile: —
References: 47

Citations per year

Authors

3

Topics & keywords

Topics

Keywords

Computer science
Embedding
Artificial intelligence
Autoencoder
Sentence
Pipeline (software)
Encoder
Convolutional neural network

UN Sustainable Development Goals

Quality Education

No related works found for this paper.