CRIS: CLIP-Driven Referring Image Segmentation

University of Sydney · Beijing University of Posts and Telecommunications · +2 more institutions

Indexed incrossref

Abstract

Referring image segmentation aims to segment a referent via a natural linguistic expression. Due to the distinct data properties between text and image, it is challenging for a network to well align text and pixel-level features. Existing approaches use pretrained models to facilitate learning, yet separately transfer the language/vision knowledge from pretrained models, ignoring the multi-modal corresponding information. Inspired by the recent advance in Contrastive Language-Image Pretraining (CLIP), in this paper, we propose an end-to-end CLIP-Driven Referring Image Segmen-tation framework (CRIS). To transfer the multi-modal knowledge effectively, CRIS resorts to vision-language decoding and contrastive…

Citation impact

342
total citations
FWCI
18.71
Percentile
100%
References
73
Citations per year

Authors

7

Topics & keywords

Keywords
  • Computer science
  • Artificial intelligence
  • Referent
  • Feature (linguistics)
  • Natural language processing
  • Benchmark (surveying)
  • Natural language
  • Pixel
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.