LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Sun Yat-sen University · Microsoft Research Asia (China)

Indexed incrossref

Abstract

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and…

Citation impact

478
total citations
FWCI
25.37
Percentile
100%
References
17
Citations per year

Authors

5

Topics & keywords

Keywords
  • Computer science
  • Artificial intelligence
  • Natural language processing
  • AKA
  • Modality (human–computer interaction)
  • Information retrieval
  • Pattern recognition (psychology)
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.