preprintarXiv (Cornell University)May 23, 2022GREEN OA

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Indexed inarxivdatacite

Abstract

We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. Imagen achieves a new state-of-the-art FID score of 7.27 on the COCO…

Citation impact

2,105
total citations
FWCI
Percentile
References
0
Citations per year

Authors

14

Topics & keywords

Keywords
  • Computer science
  • Image (mathematics)
  • Fidelity
  • Language model
  • Artificial intelligence
  • Benchmark (surveying)
  • Natural language processing
  • Cartography
UN Sustainable Development Goals
  • Quality Education
No related works found for this paper.