VSE++: Improving Visual-Semantic Embeddings with Hard Negatives
Indexed inarxivdatacite
Abstract
We present a new technique for learning visual-semantic embeddings for cross-modal retrieval. Inspired by hard negative mining, the use of hard negatives in structured prediction, and ranking loss functions, we introduce a simple change to common loss functions used for multi-modal embeddings. That, combined with fine-tuning and use of augmented data, yields significant gains in retrieval performance. We showcase our approach, VSE++, on MS-COCO and Flickr30K datasets, using ablation studies and comparisons with existing methods. On MS-COCO our approach outperforms state-of-the-art methods by 8.8% in caption retrieval and 11.3% in image retrieval (at R@1).
Citation impact
580
total citations
- FWCI
- —
- Percentile
- —
- References
- 29
Citations per year
Authors
4Topics & keywords
Topics
Keywords
- Computer science
- Negative
- Artificial intelligence
- Information retrieval
- Art
- Visual arts
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.