Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

Liang, Feng; Wu, BoRui; Dai, Xiaoliang; Li, Kunpeng; Zhao, Yinan; Zhang, Hang; Zhang, Peizhao; Vajda, P.; Marculescu, Diana

doi:10.1109/cvpr52729.2023.00682

articleJun 1, 2023Closed access

Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP

FLFeng Liang BWBoRui Wu XDXiaoliang Dai KLKunpeng Li YZYinan Zhao

The University of Texas at Austin · META Health

Indexed incrossref

Abstract

Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image…

Citation impact

372

total citations

FWCI: 61.69
Percentile: 100%
References: 48

Citations per year

Authors

9

Topics & keywords

Topics

Keywords

Computer science
Vocabulary
Segmentation
Natural language processing
Artificial intelligence
Image segmentation
Information retrieval
Computer vision

UN Sustainable Development Goals

Quality Education

No related works found for this paper.