ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Xu, Yufei; Zhang, Jing; Zhang, Qiming; Tao, Dacheng

doi:10.48550/arxiv.2204.12484

preprintarXiv (Cornell University)Apr 26, 2022GREEN OA

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

YXYufei Xu JZJing Zhang QZQiming Zhang DTDacheng Tao

Indexed inarxivdatacite

Abstract

Although no specific domain knowledge is considered in the design, plain vision transformers have shown excellent performance in visual recognition tasks. However, little effort has been made to reveal the potential of such simple structures for pose estimation tasks. In this paper, we show the surprisingly good capabilities of plain vision transformers for pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model called ViTPose. Specifically, ViTPose employs plain and non-hierarchical vision transformers as backbones to extract features for a given…

Citation impact

351

total citations

FWCI: —
Percentile: —
References: 0

Citations per year

Authors

4

Topics & keywords

Topics

Keywords

Computer science
Transformer
Scalability
Pose
Artificial intelligence
Machine learning
Benchmarking
Benchmark (surveying)

No related works found for this paper.