Large-scale Multi-Modality Pretrained Models: Applications and Experiences

Jingren Zhou

2021 (modified: 08 Nov 2022)ACM Multimedia 2021Readers: Everyone

Abstract: In this talk, we present our experiences and applications of large-scale multi-modality pretrained models, developed at Alibaba and Ant Group. We first present a cross-modal pretraining method called M6 (Multi-Modality to Multi-Modality Multitask Mega-transformer) [1], for unified pretraining on the data of multiple modalities. We scale the model size up to 1 trillion parameters [2], and build the largest pretrained model in Chinese. We apply the model to a series of downstream applications, and demonstrate its outstanding performance in comparison with strong baselines. Furthermore, we specifically design a downstream task of text-guided image generation [3], and show that the finetuned M6 can create high-quality images with high resolution and fidelity. We also present research and applications of image editing with pretrained Generative Adversarial Networks (GANs). A general principle between the underlying manifold and the generator is discovered. Based on our discovery, we propose an algorithm for GANs with low-rank factorization [4], which can be harnessed for image editing with pretrained GAN models.

0 Replies