Keywords: benchmarks, large reasoning model, safety, risk awareness
Abstract: Existing safety evaluations primarily assess response-level safety, leaving
reasoning-level risks unmeasured. Despite the remarkable proficiency of Large
Reasoning Models (LRMs) in handling complex reasoning tasks, their reliabil-
ity in safety-critical scenarios remains uncertain. We identify Superficial Safety
Alignment (SSA): a phenomenon where models produce superficially safe outputs
while internal reasoning processes fail to genuinely detect and mitigate underly-
ing risks, creating a dangerous illusion of safety and rendering systems prone to
catastrophic failure under minor perturbations. To systematically investigate SSA,
we introduce Beyond Safe Answers (BSA), a novel benchmark comprising 2,000
challenging instances organized into three distinct SSA scenarios and spanning
nine risk categories, each meticulously annotated with risk rationales. We evaluate
23 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with the
best model reaching 54.57% accuracy on risk-rationale identification. Current
benchmarks are largely blind to this latent risk; to our knowledge, BSA is the
first benchmark designed to systematically diagnose SSA. We further explore the
efficacy of safety rules, specialized fine-tuning on safety reasoning data, and diverse
decoding strategies in mitigating SSA. Our work aims for verifiably robust safety
reasoning in LRMs, moving beyond mere superficial compliance and enabling
practitioners to evaluate and improve safety-reasoning fidelity with measurable
evidence.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 13570
Loading