MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video
Wuhan University · Technical University of Munich · +1 more institution
Abstract
Recent transformer-based solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn spatio-temporal correlation. We observe that the motions of different joints differ significantly. However, the previous methods cannot efficiently model the solid inter-frame correspondence of each joint, leading to insufficient learning of spatial-temporal correlation. We propose MixSTE (Mixed Spatio-Temporal Encoder), which has a temporal transformer block to separately model the temporal motion of each joint and a spatial transformer block to learn inter-joint spatial correlation. These two blocks are utilized alternately to obtain better…
Citation impact
- FWCI
- 19.35
- Percentile
- 100%
- References
- 70
Authors
5Topics & keywords
- Encoder
- Computer science
- Artificial intelligence
- Transformer
- Coherence (philosophical gambling strategy)
- Joint (building)
- Pattern recognition (psychology)
- Motion estimation