CLIP-ReID: Exploiting Vision-Language Model for Image Re-identification without Concrete Text Labels

East China Normal University

Indexed incrossref

Abstract

Pre-trained vision-language models like CLIP have recently shown superior performances on various downstream tasks, including image classification and segmentation. However, in fine-grained image re-identification (ReID), the labels are indexes, lacking concrete text descriptions. Therefore, it remains to be determined how such models could be applied to these tasks. This paper first finds out that simply fine-tuning the visual model initialized by the image encoder in CLIP, has already obtained competitive performances in various ReID tasks. Then we propose a two-stage strategy to facilitate a better visual representation. The key idea is to fully exploit the cross-modal description ability in CLIP through a…

Citation impact

232
total citations
FWCI
13.67
Percentile
100%
References
92
Citations per year

Authors

3

Topics & keywords

Keywords
  • Computer science
  • Encoder
  • Embedding
  • Identification (biology)
  • Feature (linguistics)
  • Artificial intelligence
  • Image (mathematics)
  • Code (set theory)
No related works found for this paper.

Funding