Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model

Tsinghua University · Microsoft Research Asia (China) · +1 more institution

Indexed incrossref

Abstract

Recently, vision-language pre-training shows great potential in open-vocabulary object detection, where detectors trained on base classes are devised for detecting new classes. The class text embedding is firstly generated by feeding prompts to the text encoder of a pre-trained vision-language model. It is then used as the region classifier to supervise the training of a detector. The key element that leads to the success of this model is the proper prompt, which requires careful words tuning and ingenious design. To avoid laborious prompt engineering, there are some prompt representation learning methods being proposed for the image classification task, which however can only be sub-optimal solutions when…

Citation impact

312
total citations
FWCI
17.47
Percentile
100%
References
59
Citations per year

Authors

6

Topics & keywords

Keywords
  • Computer science
  • Pascal (unit)
  • Artificial intelligence
  • Object detection
  • Vocabulary
  • Classifier (UML)
  • Natural language processing
  • Detector
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.

Funding