Towards Tax-Free Safety Alignment for Large Reasoning Models

Published: 25 Feb 2026, Last Modified: 25 Feb 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0
Abstract: Safety alignment of Large Reasoning Models (LRMs) often exhibits a safety tax: improving safety and refusal behavior can degrade general capabilities. We study a data-centric hypothesis: the tax is largely driven by distribution shift (proxied by perplexity under the base model) and safety constraints (proxied by toxicity of base-model generations). We propose LOMO-CV, a lightweight and gradient-free pipeline that learns an interpretable linear scorer over dataset descriptors and uses it to rank and filter safety data. LOMO-CV is trained from benchmark-induced pairwise dataset preferences pooled across multiple model families via leave-one-model-out cross-validation. On DeepSeek-R1-Distill-Qwen-7B, LOMO-CV improves Math Score by +6.66 points over the best single-dataset baseline while maintaining near-maximal Safety Score (99.0). Across five holdout models, LOMO-CV yields Safety Score within 1 point of the best baseline in 4/5 cases, while capability retention is model-dependent, highlighting both the promise and current failure modes of data-only safety tax reduction.
Loading