ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding
Salesforce (United States) · The University of Texas at Austin · +1 more institution
Abstract
The recognition capabilities of current state-of-the-art 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories. In its 2D counterpart, recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language. Inspired by this, leveraging multimodal information for 3D modality could be promising to improve 3D understanding under the restricted data regime, but this line of research is not well studied. Therefore, we introduce ULIP to learn a unified representation of image, text, and 3D point cloud by pre-training with object triplets from the three modalities. To overcome the shortage of…
Citation impact
- FWCI
- 23.11
- Percentile
- 100%
- References
- 84
Authors
9Topics & keywords
- Computer science
- Point cloud
- Representation (politics)
- Artificial intelligence
- Contextual image classification
- Point (geometry)
- Modality (human–computer interaction)
- Modalities
- Quality Education