Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods

Jiali Cheng; Hadi Amiri

Do Students Debias Like Teachers? On the Distillability of Bias Mitigation Methods

Jiali Cheng, Hadi Amiri

Published: 01 Jul 2025, Last Modified: 01 Jul 2025ICML 2025 R2-FM Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Knowledge Distillation, Bias Mitigation, Spurious Correlation

Abstract: Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations, shortcuts and task-irrelevant features that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on natural language inference (NLI) and image classification tasks, with a focus on the transferability of ``debiasing'' capabilities from teacher models to student models. Through extensive experiments, we illustrate several key findings: (i) the effect of KD on debiasing performance depends on the underlying debiasing method, the relative scale of the models involved, and the size of the training set; (ii) KD effectively transfers debiasing capabilities when teacher and student are similar in scale (number of parameters); (iii) KD may amplify the student model's reliance on spurious features, and this effect does not diminish as the teacher model scales up; and (iv) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and Given the above findings, we propose three effective solutions to improve the distillability of debiasing methods: developing high quality data for augmentation, implementing iterative knowledge distillation, and initializing student models with weights obtained from teacher models.

Submission Number: 86

Loading