CRIS: CLIP-Driven Referring Image Segmentation

Wang, Zhaoqing; Lu, Yu; Li, Qiang; Tao, Xunqiang; Guo, Yandong; Gong, Mingming; Liu, Tongliang

doi:10.1109/cvpr52688.2022.01139

article2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Jun 1, 2022Closed access

CRIS: CLIP-Driven Referring Image Segmentation

ZWZhaoqing Wang YLYu Lu QLQiang Li XTXunqiang Tao YGYandong Guo

University of Sydney · Beijing University of Posts and Telecommunications · +2 more institutions

Indexed incrossref

Abstract

Referring image segmentation aims to segment a referent via a natural linguistic expression. Due to the distinct data properties between text and image, it is challenging for a network to well align text and pixel-level features. Existing approaches use pretrained models to facilitate learning, yet separately transfer the language/vision knowledge from pretrained models, ignoring the multi-modal corresponding information. Inspired by the recent advance in Contrastive Language-Image Pretraining (CLIP), in this paper, we propose an end-to-end CLIP-Driven Referring Image Segmen-tation framework (CRIS). To transfer the multi-modal knowledge effectively, CRIS resorts to vision-language decoding and contrastive…

Citation impact

342

total citations

FWCI: 18.71
Percentile: 100%
References: 73

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Computer science
Artificial intelligence
Referent
Feature (linguistics)
Natural language processing
Benchmark (surveying)
Natural language
Pixel

UN Sustainable Development Goals

Quality Education

No related works found for this paper.