Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding TasksDownload PDF

Anonymous

17 Jun 2023ACL ARR 2023 June Blind SubmissionReaders: Everyone
Abstract: Foundation models or pre-trained models have substantially improved the performance of various language, vision, and vision-language understanding tasks. However, existing foundation models can only perform the best in one type of tasks, namely language, vision, or vision-language. It is still an open question whether it is possible to construct a general foundation model performing the best for all the understanding tasks. In this paper, we propose a new method for training the general foundation model, XFM (the X-Foundation Model). XFM has one language encoder, one vision encoder, and one fusion encoder, as well as a new training method. The training method includes two new techniques for learning XFM from text, image, and image-text pair data. One is to stop gradients from the vision-language training when learning the language encoder. The other is to leverage the vision-language training to guide the learning of the vision encoder. Extensive experiments on benchmark datasets show that XFM can significantly outperform existing general foundation models and perform better than or comparable to existing foundation models specifically for language, vision, or vision-language understanding.
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
0 Replies

Loading