ReBiSA: Data Reweighting with Bilevel Optimization for Safety Alignment

ICLR 2026 Conference Submission16540 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: data reweighting, safety alignment, bilevel optimization, large language models
TL;DR: A bilevel data reweighting approach that assigns higher weights to safe samples for improving LLM safety alignment.
Abstract: Ensuring safety in large language models (LLMs) is a critical yet challenging task, since existing alignment approaches typically depend on costly human feedback, reinforcement learning, or large auxiliary models. We propose \textbf{ReBiSA}, a bilevel optimization-based data reweighting framework that provides a lightweight and transferable approach to safety alignment. ReBiSA employs a multi-layer perceptron (MLP) reweighting network that maps training losses to adaptive weights, which are updated using safety signals from a validation set. This enables the model to automatically emphasize safe data while down-weighting unsafe data during fine-tuning. Unlike prior methods that assign individual parameters to samples or rely heavily on auxiliary models, ReBiSA achieves both efficiency and transferability. Experiments on safety alignment benchmarks show that ReBiSA consistently improves safety performance over baselines, while being scalable to larger datasets and diverse model backbones.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 16540
Loading