RegionCLIP: Region-based Language-Image Pretraining
University of Wisconsin–Madison · Microsoft Research (United Kingdom) · +2 more institutions
Abstract
Contrastive language-image pretraining (CLIP) using image-text pairs has achieved impressive results on image classification in both zero-shot and transfer learning set-tings. However, we show that directly applying such mod-els to recognize image regions for object detection leads to unsatisfactory performance due to a major domain shift: CLIP was trained to match an image as a whole to a text de-scription, without capturing the fine-grained alignment be-tween image regions and text spans. To mitigate this issue, we propose a new method called RegionCLIP that signifi-cantly extends CLIP to learn region-level visual representations, thus enabling fine-grained alignment between image regions and textual…
Citation impact
- FWCI
- 26.49
- Percentile
- 100%
- References
- 92
Authors
11Topics & keywords
- Computer science
- Artificial intelligence
- Image (mathematics)
- Vocabulary
- Task (project management)
- Set (abstract data type)
- Object (grammar)
- Feature (linguistics)
- Quality Education