Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Indexed inarxivdatacite
Abstract
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO…
Citation impact
2,105
total citations
- FWCI
- —
- Percentile
- —
- References
- 0
Citations per year
Authors
14Topics & keywords
Topics
Keywords
- Computer science
- Image (mathematics)
- Fidelity
- Language model
- Artificial intelligence
- Benchmark (surveying)
- Natural language processing
- Cartography
UN Sustainable Development Goals
- Quality Education
No related works found for this paper.