ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: We propose a scalable way for building a general representation model toward unlimited modalities.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: Representation Learning; Computer Vision; Multimodal pretraining
Abstract: In this work, we propose ONE-PEACE, a highly extensible model with 4B parameters that seamlessly aligns and integrates representations across vision, audio, and language modalities. The ONE-PEACE architecture consists of shared self-attention layers, modality adapters and FFNs. This design allows for multi-modal fusion through self-attention layers, while also providing the flexibility to easily incorporate new modalities. Two modality-agnostic pretraining tasks, cross-modal aligning contrast and intra-modal denoising contrast, are developed to align the semantic space of different modalities and capture fine-grained details within each modality simultaneously. With the scaling-friendly architecture and tasks, ONE-PEACE has the potential to expand to unlimited modalities. Without utilizing any vision or language pretrained model for initialization, ONE-PEACE achieves new SOTAs across a wide range of uni-modal and cross-modal tasks. Furthermore, we show that ONE-PEACE possesses a strong emergent retrieval capability, enabling it to align modalities that are not paired in the training data.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
Supplementary Material: zip
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4516
Loading