RegionCLIP: Region-based Language-Image Pretraining

Zhong, Yiwu; Yang, Jianwei; Zhang, Pengchuan; Li, Chunyuan; Codella, Noel; Li, Liunian Harold; Zhou, Luowei; Dai, Xiyang; Yuan, Lu; Li, Yin; Gao, Jianfeng

doi:10.1109/cvpr52688.2022.01629

article2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)Jun 1, 2022Closed access

RegionCLIP: Region-based Language-Image Pretraining

YZYiwu Zhong JYJianwei Yang PZPengchuan Zhang CLChunyuan Li NCNoel Codella

University of Wisconsin–Madison · Microsoft Research (United Kingdom) · +2 more institutions

Indexed incrossref

Abstract

Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning set-tings. However, we show that directly applying such mod-els to recognize image regions for object detection leads to unsatisfactory performance due to a major domain shift: CLIP was trained to match an image as a whole to a text de-scription, without capturing the fine-grained alignment be-tween image regions and text spans. To mitigate this issue, we propose a new method called RegionCLIP that signifi-cantly extends CLIP to learn region-level visual representations, thus enabling fine-grained alignment between image regions and textual…

Citation impact

491

total citations

FWCI: 26.49
Percentile: 100%
References: 92

Citations per year

Authors

11

Topics & keywords

Topics

Keywords

Computer science
Artificial intelligence
Image (mathematics)
Vocabulary
Task (project management)
Set (abstract data type)
Object (grammar)
Feature (linguistics)

UN Sustainable Development Goals

Quality Education

No related works found for this paper.