Abstract: In this article, we introduce \textit{audio-visual dataset distillation}, a task to construct a smaller yet representative synthetic audio-visual dataset that maintains the cross-modal semantic association between audio and visual modalities. Dataset distillation techniques have primarily focused on image classification. However, with the growing capabilities of audio-visual models and the vast datasets required for their training, it is necessary to explore distillation methods beyond the visual modality. Our approach builds upon the foundation of Distribution Matching (DM), extending it to handle the unique challenges of audio-visual data. A key challenge is to jointly learn synthetic data that distills both the modality-wise information and natural alignment from real audio-visual data. We introduce a vanilla audio-visual distribution matching framework that separately trains visual-only and audio-only DM components, enabling us to investigate the effectiveness of audio-visual integration and various multimodal fusion methods. To address the limitations of unimodal distillation, we propose two novel matching losses: implicit cross-matching and cross-modal gap matching. These losses work in conjunction with the vanilla unimodal distribution matching loss to enforce cross-modal alignment and enhance the audio-visual dataset distillation process. Extensive audio-visual classification and retrieval experiments on four audio-visual datasets, AVE, MUSIC-21, VGGSound, and VGGSound-10K, demonstrate the effectiveness of our proposed matching approaches and validate the benefits of audio-visual integration with condensed data. This work establishes a new frontier in audio-visual dataset distillation, paving the way for further advancements in this exciting field. \textit{Our source code and pre-trained models will be released}.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - Updated "While dataset distillation techniques..." sentence in the abstract
- Updated related works section to elaborate on the differences with a previous method.
- Added a line in Section 4.1 about using whole data performance as a reference.
- Updated Section 5 to highlight the limitations of data distillation over large datasets.
- Removed "upper bound" related mentions in Table 3, Table 6, Appendix C, Appendix F
- Added additional ablation study experiments in Table 4 (right).
- Additional detail about the out-of-memory issue of MTT in Section 4.2 paragraph "Comparison with Data Distillation Baselines".
- We have updated Figure 3 to show the learnable components by adding backpropagation to synthetic audio and visual data.
- We have updated the loss names from "Joint matching(JM)" and "modality gap matching(MGM)" to "Implicit cross-matching (ICM)" and "cross-modal gap matching (CGM)", respectively, to improve the intuitiveness.
- Removed redundant usage of "alignment" and "cross-modal matching" phrases.
- Typo in Appendix Algorithm 1: L to $\mathcal{L}$
- Typo in Appendix Section E: removed "Retrieval results on VGGSound and Music-21 datasets are shown in Tab. ??" as these results are already present in Table 6.
Assigned Action Editor: ~Charles_Xu1
Submission Number: 2968
Loading