RemoteCLIP: A Vision Language Foundation Model for Remote Sensing
Hohai University · Hong Kong University of Science and Technology · +4 more institutions
Abstract
General-purpose foundation models have led to recent breakthroughs in artificial intelligence. In remote sensing, self-supervised learning (SSL) and Masked Image Modeling (MIM) have been adopted to build foundation models. However, these models primarily learn low-level features and require annotated data for fine-tuning. Moreover, they are inapplicable for retrieval and zero-shot applications due to the lack of language understanding. To address these limitations, we propose RemoteCLIP, the first vision-language foundation model for remote sensing that aims to learn robust visual features with rich semantics and aligned text embeddings for seamless downstream application. To address the scarcity of…
Citation impact
- FWCI
- 70.03
- Percentile
- 100%
- References
- 124
Authors
8Topics & keywords
- Computer science
- Artificial intelligence
- Leverage (statistics)
- Machine learning
- Language model
- Benchmark (surveying)
- Information retrieval