articleJun 1, 2023Closed access
ImageBind One Embedding Space to Bind Them All
Indexed incrossref
Abstract
We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications ‘out-of-the-box’ including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image…
Citation impact
691
total citations
- FWCI
- 78.50
- Percentile
- 100%
- References
- 122
Citations per year
Authors
7Topics & keywords
Topics
Keywords
- Embedding
- Space (punctuation)
- Computer science
- Artificial intelligence
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.