Keywords: Bias, Trustworthiness, Model Repair, LLM
TL;DR: Two-stage repair for fine-tuned LLMs: identify harmful samples, select a diverse subset with DPP, and apply a constrained PBRF update. Improves trustworthiness (+19%) with minimal ≤1% perplexity cost, in seconds.
Abstract: Supervised fine-tuning (SFT) improves the perplexity of the large language model (LLM), but can also degrade its trustworthiness, leading to the generation of untruthful, biased, or unsafe content during user interactions. These problems are often traced to specific phrases or patterns in the training data. However, correcting them usually requires expensive retraining or new data collection. In this work, we propose a two-stage, compute-efficient repair of the post-SFT models that enhances trustworthiness while preserving downstream performance. In the first stage, we identify the training samples responsible for failures on truthfulness, stereotypical bias, and machine ethics. To enable efficient repair, we then select a small and diverse subset of these examples using determinantal point process (DPP) based regularization. In the second stage, we repair the model under the framework of Proximal Bregman Response Function (PBRF) using a gradient ascent-based parameter update, which enhances trustworthiness while preserving perplexity of the downstream task. We evaluate our method on multiple LLMs of varying sizes and demonstrate up to 19\% improvement in trustworthiness metrics with minimal impact ($\leq1\%$) on perplexity. Our method repairs fine-tuned models within seconds and offers a practical alternative to hours of retraining required for model repair.
Submission Number: 48
Loading