Learning Task-Agnostic Representations through Multi-Teacher Distillation

Philippe Formont; Maxime DARRIN; Banafsheh Karimian; Eric Granger; Jackie CK Cheung; Ismail Ben Ayed; Mohammadhadi Shateri; Pablo Piantanida

Learning Task-Agnostic Representations through Multi-Teacher Distillation

Philippe Formont, Maxime DARRIN, Banafsheh Karimian, Eric Granger, Jackie CK Cheung, Ismail Ben Ayed, Mohammadhadi Shateri, Pablo Piantanida

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: molecular representation, nlp, knowledge distillation, embedding models, representation learning

TL;DR: We show that interval estimation based methods produce better distilled embedders in multi-teacher distillation settings compared to MSE or Cosine base methods.

Abstract: Casting complex inputs into tractable representations is a critical step across various fields. Diverse embedding models emerge from differences in architectures, loss functions, input modalities and datasets, each capturing unique aspects of the input. Multi-teacher distillation leverages this diversity to enrich representations but often remains tailored to specific tasks. We introduce a task-agnostic framework based on a ``majority vote" objective function. We demonstrate that this function is bounded by the mutual information between the student and the teachers' embeddings, leading to a task-agnostic distillation loss that eliminates dependence on task-specific labels or prior knowledge. Comprehensive evaluations across text, vision models, and molecular modeling show that our method effectively leverages teacher diversity, resulting in representations enabling better performance for a wide range of downstream tasks such as classification, clustering, or regression. Additionally, we train and release state-of-the-art embedding models, enhancing downstream performance in various modalities.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 23194

Loading