Learning to Prompt for Open-Vocabulary Object Detection with Vision-Language Model
Tsinghua University · Microsoft Research Asia (China) · +1 more institution
Abstract
Recently, vision-language pre-training shows great potential in open-vocabulary object detection, where detectors trained on base classes are devised for detecting new classes. The class text embedding is firstly generated by feeding prompts to the text encoder of a pre-trained vision-language model. It is then used as the region classifier to supervise the training of a detector. The key element that leads to the success of this model is the proper prompt, which requires careful words tuning and ingenious design. To avoid laborious prompt engineering, there are some prompt representation learning methods being proposed for the image classification task, which however can only be sub-optimal solutions when…
Citation impact
- FWCI
- 17.47
- Percentile
- 100%
- References
- 59
Authors
6Topics & keywords
- Computer science
- Pascal (unit)
- Artificial intelligence
- Object detection
- Vocabulary
- Classifier (UML)
- Natural language processing
- Detector
- Quality Education