Extended Vision Transformer (ExViT) for Land Use and Land Cover Classification: A Multimodal Deep Learning Framework

Yao, Jing; Zhang, Bing; Li, Chenyu; Hong, Danfeng; Chanussot, Jocelyn

doi:10.1109/tgrs.2023.3284671

articleIEEE Transactions on Geoscience and Remote SensingJan 1, 2023Closed access

Extended Vision Transformer (ExViT) for Land Use and Land Cover Classification: A Multimodal Deep Learning Framework

JYJing Yao BZBing Zhang CLChenyu Li DHDanfeng Hong JCJocelyn Chanussot

Chinese Academy of Sciences · Aerospace Information Research Institute · +5 more institutions

Indexed incrossref

Abstract

The recent success of attention mechanism-driven deep models, like Vision Transformer (ViT) as one of the most representative, has intrigued a wave of advanced research to explore their adaptation to broader domains. However, current Transformer-based approaches in the remote sensing (RS) community pay more attention to single-modality data, which might lose expandability in making full use of the ever-growing multimodal Earth observation data. To this end, we propose a novel multimodal deep learning framework by extending conventional ViT with minimal modifications, abbreviated as ExViT, aiming at the task of land use and land cover classification. Unlike common stems that adopt either linear patch projection…

Citation impact

343

total citations

FWCI: 52.23
Percentile: 100%
References: 63

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Computer science
Artificial intelligence
Deep learning
Hyperspectral imaging
Convolutional neural network
Synthetic aperture radar
Modality (human–computer interaction)
Earth observation

UN Sustainable Development Goals

Reduced inequalities

No related works found for this paper.

Funding

NK
National Key Research and Development Program of China
Award: 2021YFB3900502