ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation
Indexed inarxivdatacite
Abstract
Although no specific domain knowledge is considered in the design, plain vision transformers have shown excellent performance in visual recognition tasks. However, little effort has been made to reveal the potential of such simple structures for pose estimation tasks. In this paper, we show the surprisingly good capabilities of plain vision transformers for pose estimation from various aspects, namely simplicity in model structure, scalability in model size, flexibility in training paradigm, and transferability of knowledge between models, through a simple baseline model called ViTPose. Specifically, ViTPose employs plain and non-hierarchical vision transformers as backbones to extract features for a given…
Citation impact
351
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Citations per year
Authors
4Topics & keywords
Topics
Keywords
- Computer science
- Transformer
- Scalability
- Pose
- Artificial intelligence
- Machine learning
- Benchmarking
- Benchmark (surveying)
No related works found for this paper.