CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels
Indexed incrossref
Abstract
Pre-trained vision-language models like CLIP have recently shown superior performances on various downstream tasks, including image classification and segmentation. However, in fine-grained image re-identification (ReID), the labels are indexes, lacking concrete text descriptions. Therefore, it remains to be determined how such models could be applied to these tasks. This paper first finds out that simply fine-tuning the visual model initialized by the image encoder in CLIP, has already obtained competitive performances in various ReID tasks. Then we propose a two-stage strategy to facilitate a better visual representation. The key idea is to fully exploit the cross-modal description ability in CLIP through a…
Citation impact
232
total citations
- FWCI
- 13.67
- Percentile
- 100%
- References
- 92
Citations per year
Authors
3Topics & keywords
Topics
Keywords
- Computer science
- Encoder
- Embedding
- Identification (biology)
- Feature (linguistics)
- Artificial intelligence
- Image (mathematics)
- Code (set theory)
No related works found for this paper.