EBind: a Practical Approach to Space Binding

18 Sept 2025 (modified: 18 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: representation learning, contrastive learning, space binding, zero shot learning, image retrieval, video retrieval, 3D point cloud retrieval, joint embedding
TL;DR: EBind efficiently binds image-text-video-audio-3D embedding spaces using 4-17x less parameters while remaining on par with SOTA on many benchmarks — all trainable on a single GPU within hours.
Abstract: We simplify space binding by focusing on two core components, a single encoder per modality and high-quality data; enabling training state-of-the-art models on a single GPU in a few hours as opposed to multiple days. We present EBind, an Easy, data-centric, and parameter-efficient method to Bind the embedding spaces of multiple contrastive models. We demonstrate that a simple 1.8B-parameter image-text-video-audio-3D model can outperform models 4 to 17x the size. The key to achieving this is a carefully curated dataset of three complementary data sources: i) 6.7M fully-automated multimodal quintuples sourced via SOTA retrieval models, ii) 1M diverse, semi-automated triples annotated by humans as negative, partial, or positive matches, and iii) 3.4M pre-existing captioned data items. We use 13 different evaluations to demonstrate the value of each data source. Due to limitations with existing benchmarks, we further introduce the first high-quality, consensus-annotated zero-shot classification benchmark between audio and 3D point clouds. In contrast to related work, we will open-source our code, model weights, _and_ the datasets.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 10668
Loading