articleIEEE Transactions on MultimediaJan 7, 2022Closed access

Exploiting Temporal Contexts With Strided Transformer for 3D Human Pose Estimation

Peking University · Sun Yat-sen University · +4 more institutions

Indexed incrossref

Abstract

Despite the great progress in 3D human pose estimation from videos, it is still an open problem to take full advantage of a redundant 2D pose sequence to learn representative representations for generating one 3D pose. To this end, we propose an improved Transformer-based architecture, called Strided Transformer, which simply and effectively lifts a long sequence of 2D joint locations to a single 3D pose. Specifically, a Vanilla Transformer Encoder (VTE) is adopted to model long-range dependencies of 2D pose sequences. To reduce the redundancy of the sequence, fully-connected layers in the feed-forward network of VTE are replaced with strided convolutions to progressively shrink the sequence length and…

Citation impact

293
total citations
FWCI
27.10
Percentile
100%
References
70
Citations per year

Authors

6

Topics & keywords

Keywords
  • Encoder
  • Computer science
  • Transformer
  • Computation
  • Artificial intelligence
  • Pose
  • Redundancy (engineering)
  • Pattern recognition (psychology)
No related works found for this paper.