articleJun 1, 2023Closed access

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

The University of Texas at Austin · META Health

Indexed incrossref

Abstract

Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image…

Citation impact

372
total citations
FWCI
61.69
Percentile
100%
References
48
Citations per year

Authors

9

Topics & keywords

Keywords
  • Computer science
  • Vocabulary
  • Segmentation
  • Natural language processing
  • Artificial intelligence
  • Image segmentation
  • Information retrieval
  • Computer vision
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.