Keywords: Dataset Distillation, Multi-modal
Abstract: Multi-modal dataset distillation (MDD) seeks to compress the large-scale multi-modal data, \eg, images and text, into a compact set of synthetic pairs. Existing methods typically employ a bi-trajectory distillation framework to align the trajectories of expert and student models within each modality. Although effective, this paradigm incurs significant storage and computational overhead due to the large number of checkpoints and the need for double backpropagation, limiting its efficiency and scalability. To overcome these limitations, we propose analytic parameter matching (APM), which directly matches the analytic parameters of the modal projectors rather than the entire trajectory, offering two key advantages: First, instead of storing multiple checkpoints, APM only caches two matrices, which significantly reduces the storage budget. Second, APM avoids the bi-level optimization, as the analytic parameters can be computed in a single forward pass. Theoretically, we establish the connection between these analytic parameters and matrix whitening, clarifying their benefits for MDD.
Empirically, APM achieves up to 65$\times$ storage reduction, 9.6$\times$ distillation speedup, and scales to 1000 synthetic pairs. Extensive experiments on Flickr30k and MS-COCO demonstrate the effectiveness of APM in cross-modal retrieval tasks, \eg, 12.8 IR@1 and 17.8 TR@1 under 100-pairs, outperforming existing MDD methods in most scenarios.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 735
Loading