CLIP-Driven Fine-Grained Text-Image Person Re-Identification

Yan, Shuanglin; Dong, Neng; Zhang, Liyan; Tang, Jinhui

doi:10.1109/tip.2023.3327924

articleIEEE Transactions on Image ProcessingJan 1, 2023Closed access

CLIP-Driven Fine-Grained Text-Image Person Re-Identification

SYShuanglin Yan NDNeng Dong LZLiyan Zhang JTJinhui Tang

Nanjing University of Science and Technology · Nanjing University of Aeronautics and Astronautics

PubMed

Indexed incrossrefpubmed

Abstract

Text-Image Person Re-identification (TIReID) aims to retrieve the image corresponding to the given text query from a pool of candidate images. Existing methods employ prior knowledge from single-modality pre-training to facilitate learning, but lack multi-modal correspondence information. Vision-Language Pre-training, such as CLIP (Contrastive Language-Image Pretraining), can address the limitation. However, CLIP falls short in capturing fine-grained information, thereby not fully leveraging its powerful capacity in TIReID. Besides, the popular explicit local matching paradigm for mining fine-grained information heavily relies on the quality of local parts and cross-modal inter-part interaction/guidance,…

Citation impact

275

total citations

FWCI: 31.30
Percentile: 100%
References: 70

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Computer science
Discriminative model
Artificial intelligence
Feature (linguistics)
Modality (human–computer interaction)
Inference
Sentence
Feature learning

UN Sustainable Development Goals

Reduced inequalities

No related works found for this paper.