Open-Vocabulary Semantic Segmentation with Mask-adapted CLIP
The University of Texas at Austin · META Health
Abstract
Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image…
Citation impact
- FWCI
- 61.69
- Percentile
- 100%
- References
- 48
Authors
9Topics & keywords
- Computer science
- Vocabulary
- Segmentation
- Natural language processing
- Artificial intelligence
- Image segmentation
- Information retrieval
- Computer vision
- Quality Education