Partition-Losses Fine-Tuning: Contamination-Robust Backdoor Unlearning

Partition-Losses Fine-Tuning: Contamination-Robust Backdoor Unlearning

ICLR 2026 Conference Submission18450 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Backdoor attacks, Data poisoning, Backdoor defense, Fine-tuning

TL;DR: We propose Partition-Losses Fine-Tuning (PL), a contamination-robust defense that unlearns backdoors using half as much clean tuning data as state-of-the-art methods.

Abstract: Large-scale training data and third-party checkpoints make training convenient but also leave room for poisoning-based backdoor attacks. These attacks embed a backdoor through data poisoning in the training set: the infected model behaves normally on clean inputs but predicts an attacker-chosen label whenever the trigger appears. The stealthiness poses risks for security-sensitive deployment and model reuse. Post-training fine-tuning has become a practical default defense as it is computationally efficient and does not require control over the original training pipeline. However, existing fine-tuning methods rely on a clean set to unlearn the backdoor indirectly. This assumption is fragile in reality: curation errors or undetected triggers can contaminate the "clean" set. As a result, state-of-the-art clean-only fine-tuning often fails to purify the backdoor behavior while maintaining the original functionality. We propose Partition-Losses Fine-Tuning (PL), a simple, architecture- and domain-agnostic loss modification that leverages both clean and flagged malicious samples. PL jointly minimizes benign loss and maximizes target-class loss, explicitly pushing the model away from the implanted trigger-to-target association. Comprehensive experiments show that PL matches or surpasses clean-only fine-tuning methods under the same computational budget while halving the required clean samples. Crucially, PL remains effective under realistic contamination of both fine-tuning sets and is stable across hyperparameter choices and data availability.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 18450

Loading