LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
Sun Yat-sen University · Microsoft Research Asia (China)
Abstract
Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and…
Citation impact
- FWCI
- 25.37
- Percentile
- 100%
- References
- 17
Authors
5Topics & keywords
- Computer science
- Artificial intelligence
- Natural language processing
- AKA
- Modality (human–computer interaction)
- Information retrieval
- Pattern recognition (psychology)
- Quality Education