X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Ma, Yiwei; Xu, Guohai; Sun, Xiaoshuai; Yan, Ming; Zhang, Ji; Ji, Rongrong

doi:10.1145/3503161.3547910

articleProceedings of the 30th ACM International Conference on MultimediaOct 10, 2022Closed access

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

YMYiwei Ma GXGuohai Xu XSXiaoshuai Sun MYMing Yan JZJi Zhang

Xiamen University · Alibaba Group (China)

Indexed incrossref

Abstract

Video-text retrieval has been a crucial and fundamental task in multi-modal research. The development of video-text retrieval has been considerably promoted by large-scale multi-modal contrastive pre-training, which primarily focuses on coarse-grained or fine-grained contrast. However, cross-grained contrast, which is the contrast between coarse-grained representations and fine-grained representations, has rarely been explored in prior research. Compared with fine-grained or coarse-grained contrasts, cross-grained contrast calculate the correlation between coarse-grained features and each fine-grained feature, and is able to filter out the unnecessary fine-grained features guided by the coarse-grained feature…

Citation impact

259

total citations

FWCI: 13.75
Percentile: 100%
References: 35

Citations per year

Authors

6

Topics & keywords

Topics

Keywords

Computer science
Similarity (geometry)
Contrast (vision)
Focus (optics)
Feature (linguistics)
Artificial intelligence
Filter (signal processing)
Information retrieval

No related works found for this paper.