LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

Huang, Yupan; Lv, Tengchao; Cui, Lei; Lu, Yutong; Wei, Furu

doi:10.1145/3503161.3548112

articleProceedings of the 30th ACM International Conference on MultimediaOct 10, 2022Closed access

LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking

YHYupan Huang TLTengchao Lv LCLei Cui YLYutong Lu FWFuru Wei

Sun Yat-sen University · Microsoft Research Asia (China)

Indexed incrossref

Abstract

Self-supervised pre-training techniques have achieved remarkable progress in Document AI. Most multimodal pre-trained models use a masked language modeling objective to learn bidirectional representations on the text modality, but they differ in pre-training objectives for the image modality. This discrepancy adds difficulty to multimodal representation learning. In this paper, we propose LayoutLMv3 to pre-train multimodal Transformers for Document AI with unified text and image masking. Additionally, LayoutLMv3 is pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and…

Citation impact

478

total citations

FWCI: 25.37
Percentile: 100%
References: 17

Citations per year

Authors

5

Topics & keywords

Topics

Keywords

Computer science
Artificial intelligence
Natural language processing
AKA
Modality (human–computer interaction)
Information retrieval
Pattern recognition (psychology)

UN Sustainable Development Goals

Quality Education

No related works found for this paper.