X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Xiamen University · Alibaba Group (China)

Indexed incrossref

Abstract

Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained or fine-grained contrast. However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research. Compared with fine-grained or coarse-grained contrasts, cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature…

Citation impact

259
total citations
FWCI
13.75
Percentile
100%
References
35
Citations per year

Authors

6

Topics & keywords

Keywords
  • Computer science
  • Similarity (geometry)
  • Contrast (vision)
  • Focus (optics)
  • Feature (linguistics)
  • Artificial intelligence
  • Filter (signal processing)
  • Information retrieval
No related works found for this paper.