TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

Liu, Yuliang; Yang, Biao; Liu, Qiang; Li, Zhang; Ma, Zhiyin; Zhang, Shuo; Bai, Xiang

doi:10.1109/tpami.2026.3653415

articleIEEE Transactions on Pattern Analysis and Machine IntelligenceJan 27, 2026Closed access

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document

YLYuliang Liu BYBiao Yang QLQiang Liu ZLZhang Li ZMZhiyin Ma

Huazhong University of Science and Technology · Kingsoft (China)

PubMed

Indexed incrossrefpubmed

Abstract

We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. Our approach introduces enhancement across several dimensions: By adopting Shifted Window Attention layer, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model's performance. Moreover, by expanding our model's capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability. Evaluation on 12 benchmarks shows notable…

Citation impact

11

total citations

FWCI: 95.02
Percentile: 99%
References: 0

Citations per year

Authors

7

Topics & keywords

Topics

Keywords

Spotting
Security token
Benchmark (surveying)
Key (lock)
Filter (signal processing)
Code (set theory)
Similarity (geometry)
Window (computing)

No related works found for this paper.

Funding

NN
National Natural Science Foundation of China
Awards: 62225603, 62576147