UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All

Yuanhuiyi Lyu, Xu Zheng, Jiazhou Zhou, Lin Wang

Published: 01 Jan 2024, Last Modified: 13 Nov 2024CVPR 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: We present UniBind, a flexible and efficient approach that learns a unified representation space for seven diverse modalities - image, text, audio, point cloud, thermal, video, and event data. Existing works, e.g., ImageBind [13], treat the image as the central modality and build an image-centered representation space; however, the space may be sub-optimal as it leads to an unbalanced representation space among all modalities. Moreover, the category names are directly used to extract text embeddings for the down-stream tasks, making it hardly possible to represent the se-mantics of multi-modal data. The ‘out-of-the-box’ insight of our UniBind is to make the alignment centers modality-agnostic and further learn a unified and balanced repre-sentation space, empowered by the large language mod-els (LLMs). UniBind is superior in its flexible application to all CLIP-style models and delivers remarkable performance boosts. To make this possible, we 1) construct a knowledge base of text with the help of LLMs and multi-modal LLMs; 2) adaptively build LLM-augmented class-wise embedding centers on top of the knowledge base and encoded visual embeddings; 3) align all the embeddings to the LLM-augmented embedding centers via contrastive learning to achieve a unified and balanced representation space. UniBind shows strong zero-shot recognition performance gains over prior arts by an average of 6.36%. Fi-nally, we achieve new state-of-the-art performance, e.g., a 6.75% gain on ImageNet, on the multi-modal fine-tuning setting while reducing 90% of the learnable parameters.