MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose Estimation in Video

Wuhan University · Technical University of Munich · +1 more institution

Indexed incrossref

Abstract

Recent transformer-based solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn spatio-temporal correlation. We observe that the motions of different joints differ significantly. However, the previous methods cannot efficiently model the solid inter-frame correspondence of each joint, leading to insufficient learning of spatial-temporal correlation. We propose MixSTE (Mixed Spatio-Temporal Encoder), which has a temporal transformer block to separately model the temporal motion of each joint and a spatial transformer block to learn inter-joint spatial correlation. These two blocks are utilized alternately to obtain better…

Citation impact

365
total citations
FWCI
19.35
Percentile
100%
References
70
Citations per year

Authors

5

Topics & keywords

Keywords
  • Encoder
  • Computer science
  • Artificial intelligence
  • Transformer
  • Coherence (philosophical gambling strategy)
  • Joint (building)
  • Pattern recognition (psychology)
  • Motion estimation
No related works found for this paper.