ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Peng Wang; Shijie Wang; Junyang Lin; Shuai Bai; Xiaohuan Zhou; Jingren Zhou; Xinggang Wang; Chang Zhou

ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

Peng Wang, Shijie Wang, Junyang Lin, Shuai Bai, Xiaohuan Zhou, Jingren Zhou, Xinggang Wang, Chang Zhou

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: representation learning for computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: We propose a scalable way for building a general representation model toward unlimited modalities.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: Representation Learning; Computer Vision; Multimodal pretraining

Abstract: In this work, we propose ONE-PEACE, a highly extensible model with 4B parameters that seamlessly aligns and integrates representations across vision, audio, and language modalities. The ONE-PEACE architecture consists of shared self-attention layers, modality adapters and FFNs. This design allows for multi-modal fusion through self-attention layers, while also providing the flexibility to easily incorporate new modalities. Two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, are developed to align the semantic space of different modalities and capture fine-grained details within each modality simultaneously. With the scaling-friendly architecture and tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without utilizing any vision or language pretrained model for initialization, ONE-PEACE achieves new SOTAs across a wide range of uni-modal and cross-modal tasks. Furthermore, we show that ONE-PEACE possesses a strong emergent retrieval capability, enabling it to align modalities that are not paired in the training data.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4516

Loading