preprintarXiv (Cornell University)Apr 26, 2022GREEN OA

ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation

Indexed inarxivdatacite

Abstract

Although no specific domain knowledge is considered in the design, plain vision transformers have shown excellent performance in visual recognition tasks. However, little effort has been made to reveal the potential of such simple structures for pose estimation tasks. In this paper, we show the surprisingly good capabilities of plain vision transformers for pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model called ViTPose. Specifically, ViTPose employs plain and non-hierarchical vision transformers as backbones to extract features for a given…

Citation impact

351
total citations
FWCI
Percentile
References
0
Citations per year

Authors

4

Topics & keywords

Keywords
  • Computer science
  • Transformer
  • Scalability
  • Pose
  • Artificial intelligence
  • Machine learning
  • Benchmarking
  • Benchmark (surveying)
No related works found for this paper.