Exploiting Temporal Contexts With Strided Transformer for 3D Human Pose Estimation
Peking University · Sun Yat-sen University · +4 more institutions
Abstract
Despite the great progress in 3D human pose estimation from videos, it is still an open problem to take full advantage of a redundant 2D pose sequence to learn representative representations for generating one 3D pose. To this end, we propose an improved Transformer-based architecture, called Strided Transformer, which simply and effectively lifts a long sequence of 2D joint locations to a single 3D pose. Specifically, a Vanilla Transformer Encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. To reduce the redundancy of the sequence, fully-connected layers in the feed-forward network of VTE are replaced with strided convolutions to progressively shrink the sequence length and…
Citation impact
- FWCI
- 27.10
- Percentile
- 100%
- References
- 70
Authors
6Topics & keywords
- Encoder
- Computer science
- Transformer
- Computation
- Artificial intelligence
- Pose
- Redundancy (engineering)
- Pattern recognition (psychology)