Generating Human Motion from Textual Descriptions with Discrete Representations
Jilin University · Tencent (China) · +2 more institutions
Abstract
In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset,…
Citation impact
- FWCI
- 29.25
- Percentile
- 100%
- References
- 93
Authors
8Topics & keywords
- Computer science
- Autoencoder
- Consistency (knowledge bases)
- Artificial intelligence
- Generative grammar
- Simple (philosophy)
- Generative model
- Motion (physics)
- Peace, Justice and strong institutions