Keywords: Knowledge Distillation, Bias Mitigation, Spurious Correlation
Abstract: Knowledge distillation (KD) is an effective method for model compression and transferring knowledge between models. However, its effect on model's robustness against spurious correlations, shortcuts and task-irrelevant features that degrade performance on out-of-distribution data remains underexplored. This study investigates the effect of knowledge distillation on natural language inference (NLI) and image classification tasks, with a focus on the transferability of ``debiasing'' capabilities from teacher models to student models. Through extensive experiments, we illustrate several key findings: (i) the effect of KD on debiasing performance depends on the underlying debiasing method, the relative scale of the models involved, and the size of the training set; (ii) KD effectively transfers debiasing capabilities when teacher and student are similar in scale (number of parameters); (iii) KD may amplify the student model's reliance on spurious features, and this effect does not diminish as the teacher model scales up; and (iv) although the overall robustness of a model may remain stable post-distillation, significant variations can occur across different types of biases; and Given the above findings, we propose three effective solutions to improve the distillability of debiasing methods: developing high quality data for augmentation, implementing iterative knowledge distillation, and initializing student models with weights obtained from teacher models.
Submission Number: 86
Loading