X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
Xiamen University · Alibaba Group (China)
Abstract
Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained or fine-grained contrast. However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research. Compared with fine-grained or coarse-grained contrasts, cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature…
Citation impact
- FWCI
- 13.75
- Percentile
- 100%
- References
- 35
Authors
6Topics & keywords
- Computer science
- Similarity (geometry)
- Contrast (vision)
- Focus (optics)
- Feature (linguistics)
- Artificial intelligence
- Filter (signal processing)
- Information retrieval