Abstract: Safety alignment of Large Reasoning Models (LRMs) often exhibits a safety tax: improving safety and refusal behavior can degrade general capabilities.
We study a data-centric hypothesis: the tax is largely driven by distribution shift (proxied by perplexity under the base model) and safety constraints (proxied by toxicity of base-model generations).
We propose LOMO-CV, a lightweight and gradient-free pipeline that learns an interpretable linear scorer over dataset descriptors and uses it to rank and filter safety data.
LOMO-CV is trained from benchmark-induced pairwise dataset preferences pooled across multiple model families via leave-one-model-out cross-validation.
On DeepSeek-R1-Distill-Qwen-7B, LOMO-CV improves Math Score by +6.66 points over the best single-dataset baseline while maintaining near-maximal Safety Score (99.0).
Across five holdout models, LOMO-CV yields Safety Score within 1 point of the best baseline in 4/5 cases, while capability retention is model-dependent, highlighting both the promise and current failure modes of data-only safety tax reduction.
Loading