Keywords: Audio-Visual Learning, Multimodal Learning, Efficient Machine Learning, Knowledge Distillation, Audio-Visual Classification, Audio-Visual Segmentation
Abstract: We propose a method for audio-visual knowledge distillation. Existing methods typically distill a student model from the latent embeddings or outputs of a teacher. The former requires matching feature dimensions, if not the same architecture, between teacher and student models while the latter supports any teacher-student pairing, but tends to be less performant. Unlike them, we do not explicitly distill from latent embeddings or outputs, but the pairwise relationships between embeddings across samples for each modality; this is realized as a kernel, which is the crux of our method, "Kernelized Token Distillation (KTD)". Specifically, we tokenize and embed the input for a given modality, and compute the Gram matrix across tokens, from which we distill. As audio and visual modalities afford different information for a task, we adaptively modulate distillation by measuring the entropy of each modality, leading to an Entropy-Monitored Kernelized Token Distillation (EM-KTD) scheme. Our method allows for flexibility in complexity of kernel function to model relationships across tokens, which are selectively distilled to ensure high-fidelity supervision for the student. We evaluate EM-KTD on VGGSound and AVS-Bench, where we use 94% fewer parameters than the teacher while preserving 96.9% in performance for audio-visual event recognition and 96.5% on audio-visual segmentation.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 4654
Loading