Extended Vision Transformer (ExViT) for Land Use and Land Cover Classification: A Multimodal Deep Learning Framework
Chinese Academy of Sciences · Aerospace Information Research Institute · +5 more institutions
Abstract
The recent success of attention mechanism-driven deep models, like Vision Transformer (ViT) as one of the most representative, has intrigued a wave of advanced research to explore their adaptation to broader domains. However, current Transformer-based approaches in the remote sensing (RS) community pay more attention to single-modality data, which might lose expandability in making full use of the ever-growing multimodal Earth observation data. To this end, we propose a novel multimodal deep learning framework by extending conventional ViT with minimal modifications, abbreviated as ExViT, aiming at the task of land use and land cover classification. Unlike common stems that adopt either linear patch projection…
Citation impact
- FWCI
- 52.23
- Percentile
- 100%
- References
- 63
Authors
5- JYJing YaoCorresponding
Chinese Academy of Sciences, Aerospace Information Research Institute
- BZBing Zhang
Chinese Academy of Sciences, Aerospace Information Research Institute, University of Chinese Academy of Sciences
- CLChenyu Li
Southeast University
- DHDanfeng Hong
Chinese Academy of Sciences, Aerospace Information Research Institute
- JCJocelyn Chanussot
Institut polytechnique de Grenoble, Centre National de la Recherche Scientifique, Chinese Academy of Sciences, GIPSA-Lab, Aerospace Information Research Institute
Topics & keywords
- Computer science
- Artificial intelligence
- Deep learning
- Hyperspectral imaging
- Convolutional neural network
- Synthetic aperture radar
- Modality (human–computer interaction)
- Earth observation
- Reduced inequalities