CogAgent: A Visual Language Model for GUI Agents

Hong, Wenyi; Wang, Weihan; Lv, Qingsong; Xu, Jiazheng; Yu, Wenmeng; Ji, Junhui; Wang, Yan; Wang, Zihan; Dong, Yuxiao; Ding, Ming; Tang, Jie

doi:10.1109/cvpr52733.2024.01354

articleJun 16, 2024Closed access

CogAgent: A Visual Language Model for GUI Agents

WHWenyi Hong WWWeihan Wang QLQingsong Lv JXJiazheng Xu WYWenmeng Yu

Tsinghua University · Zhipu AI (China)

Indexed incrossref

Abstract

People are spending an enormous amount of time on dig-ital devices through graphical user interfaces (GUIs), e.g., computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails, but struggle to understand and interact with GUIs, thus limiting their potential to increase automation levels. In this paper, we introduce CogAgent, an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders, CogA-gent supports input at a resolution of1120 × 1120, enabling it to recognize tiny page elements and text. As a general-ist visual language model, CogAgent…

Citation impact

134

total citations

FWCI: 30.02
Percentile: 100%
References: 60

Citations per year

Authors

11

Topics & keywords

Topics

Keywords

Computer science
Programming language
Human–computer interaction
Natural language processing
Artificial intelligence

No related works found for this paper.