CLIP-Driven Fine-Grained Text-Image Person Re-Identification
Nanjing University of Science and Technology · Nanjing University of Aeronautics and Astronautics
Abstract
Text-Image Person Re-identification (TIReID) aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondence information. Vision-Language Pre-training, such as CLIP (Contrastive Language-Image Pretraining), can address the limitation. However, CLIP falls short in capturing fine-grained information, thereby not fully leveraging its powerful capacity in TIReID. Besides, the popular explicit local matching paradigm for mining fine-grained information heavily relies on the quality of local parts and cross-modal inter-part interaction/guidance,…
Citation impact
- FWCI
- 31.30
- Percentile
- 100%
- References
- 70
Authors
4Topics & keywords
- Computer science
- Discriminative model
- Artificial intelligence
- Feature (linguistics)
- Modality (human–computer interaction)
- Inference
- Sentence
- Feature learning
- Reduced inequalities