Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models

Beyond Safe Answers: A Benchmark for Evaluating True Risk Awareness in Large Reasoning Models

ICLR 2026 Conference Submission13570 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmarks, large reasoning model, safety, risk awareness

Abstract: Existing safety evaluations primarily assess response-level safety, leaving reasoning-level risks unmeasured. Despite the remarkable proficiency of Large Reasoning Models (LRMs) in handling complex reasoning tasks, their reliabil- ity in safety-critical scenarios remains uncertain. We identify Superficial Safety Alignment (SSA): a phenomenon where models produce superficially safe outputs while internal reasoning processes fail to genuinely detect and mitigate underly- ing risks, creating a dangerous illusion of safety and rendering systems prone to catastrophic failure under minor perturbations. To systematically investigate SSA, we introduce Beyond Safe Answers (BSA), a novel benchmark comprising 2,000 challenging instances organized into three distinct SSA scenarios and spanning nine risk categories, each meticulously annotated with risk rationales. We evaluate 23 state-of-the-art LRMs demonstrate the difficulty of this benchmark, with the best model reaching 54.57% accuracy on risk-rationale identification. Current benchmarks are largely blind to this latent risk; to our knowledge, BSA is the first benchmark designed to systematically diagnose SSA. We further explore the efficacy of safety rules, specialized fine-tuning on safety reasoning data, and diverse decoding strategies in mitigating SSA. Our work aims for verifiably robust safety reasoning in LRMs, moving beyond mere superficial compliance and enabling practitioners to evaluate and improve safety-reasoning fidelity with measurable evidence.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 13570

Loading