CogView: Mastering Text-to-Image Generation via Transformers

Ming Ding; Zhuoyi Yang; Wenyi Hong; Wendi Zheng; Chang Zhou; Da Yin; Junyang Lin; Xu Zou; Zhou Shao; Hongxia Yang; Jie Tang

CogView: Mastering Text-to-Image Generation via Transformers

Ming Ding, Zhuoyi Yang, Wenyi Hong, Wendi Zheng, Chang Zhou, Da Yin, Junyang Lin, Xu Zou, Zhou Shao, Hongxia Yang, Jie Tang

Published: 09 Nov 2021, Last Modified: 04 May 2025NeurIPS 2021 PosterReaders: Everyone

Keywords: Transformer, pretraining, generative model, cross-modality

Abstract: Text-to-Image generation in the general domain has long been an open problem, which requires both a powerful generative model and cross-modal understanding. We propose CogView, a 4-billion-parameter Transformer with VQ-VAE tokenizer to advance this problem. We also demonstrate the finetuning strategies for various downstream tasks, e.g. style learning, super-resolution, text-image ranking and fashion design, and methods to stabilize pretraining, e.g. eliminating NaN losses. CogView achieves the state-of-the-art FID on the blurred MS COCO dataset, outperforming previous GAN-based models and a recent similar work DALL-E.

Code Of Conduct: I certify that all co-authors of this work have read and commit to adhering to the NeurIPS Statement on Ethics, Fairness, Inclusivity, and Code of Conduct.

Supplementary Material: pdf

TL;DR: CogView: Mastering Text-to-Image Generation via Transformers

Code: https://github.com/THUDM/CogView

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/cogview-mastering-text-to-image-generation/code)

13 Replies

Loading