RegionCLIP: Region-based Language-Image Pretraining

University of Wisconsin–Madison · Microsoft Research (United Kingdom) · +2 more institutions

Indexed incrossref

Abstract

Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning set-tings. However, we show that directly applying such mod-els to recognize image regions for object detection leads to unsatisfactory performance due to a major domain shift: CLIP was trained to match an image as a whole to a text de-scription, without capturing the fine-grained alignment be-tween image regions and text spans. To mitigate this issue, we propose a new method called RegionCLIP that signifi-cantly extends CLIP to learn region-level visual representations, thus enabling fine-grained alignment between image regions and textual…

Citation impact

491
total citations
FWCI
26.49
Percentile
100%
References
92
Citations per year

Authors

11

Topics & keywords

Keywords
  • Computer science
  • Artificial intelligence
  • Image (mathematics)
  • Vocabulary
  • Task (project management)
  • Set (abstract data type)
  • Object (grammar)
  • Feature (linguistics)
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.