ImageBind One Embedding Space to Bind Them All

Girdhar, Rohit; El-Nouby, Alaaeldin; Liu, Zhuang; Singh, Mannat; Alwala, Kalyan Vasudev; Joulin, Armand; Misra, Ishan

doi:10.1109/cvpr52729.2023.01457

articleJun 1, 2023Closed access

ImageBind One Embedding Space to Bind Them All

RGRohit Girdhar AEAlaaeldin El-Nouby ZLZhuang Liu MSMannat Singh KVKalyan Vasudev Alwala

Indexed incrossref

Abstract

We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications ‘out-of-the-box’ including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image…

Citation impact

691

total citations

FWCI: 78.50
Percentile: 100%
References: 122

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Embedding
Space (punctuation)
Computer science
Artificial intelligence

UN Sustainable Development Goals

Quality Education

No related works found for this paper.