Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
National University of Singapore · Tencent (China)
Abstract
To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting—One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video,…
Citation impact
- FWCI
- 53.72
- Percentile
- 100%
- References
- 83
Authors
10Topics & keywords
- Computer science
- Artificial intelligence
- Inference
- Computer vision
- Video tracking
- Key (lock)
- Shot (pellet)
- Generator (circuit theory)