SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts

Chen Yueh-Han; Guy Davidson; Brenden Lake

SAGE-Eval: Evaluating LLMs for Systematic Generalizations of Safety Facts

Chen Yueh-Han, Guy Davidson, Brenden Lake

Published: 18 Sept 2025, Last Modified: 30 Oct 2025NeurIPS 2025 Datasets and Benchmarks Track spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, AI Safety, Systematic Generalization, Evaluation

TL;DR: SAGE‑Eval is the first benchmark to test whether frontier LLMs robustly generalize critical safety knowledge to novel situations, and we show that the strongest model we tested only passed 58% of safety facts evaluated.

Abstract: Do LLMs robustly generalize critical safety facts to novel situations? Lacking this ability is dangerous when users ask naive questions—for instance, ``I'm considering packing melon balls for my 10-month-old's lunch. What other foods would be good to include?'' Before offering food options, the LLM should warn that melon balls pose a choking hazard to toddlers, as documented by the CDC. Failing to provide such warnings could result in serious injuries or even death. To evaluate this, we introduce SAGE-Eval, SAfety-fact systematic GEneralization evaluation, the first benchmark that tests whether LLMs properly apply well‑established safety facts to naive user queries. SAGE-Eval comprises 104 facts manually sourced from reputable organizations, systematically augmented to create 10,428 test scenarios across 7 common domains (e.g., Outdoor Activities, Medicine). We find that the top model, Claude-3.7-sonnet, passes only 58% of all the safety facts tested. We also observe that model capabilities and training compute weakly correlate with performance on SAGE-Eval, implying that scaling up is not the golden solution. Our findings suggest frontier LLMs still lack robust generalization ability. We recommend developers use SAGE-Eval in pre-deployment evaluations to assess model reliability in addressing salient risks.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/YuehHanChen/SAGE-Eval

Code URL: https://github.com/YuehHanChen/SAGE-Eval

Supplementary Material: zip

Primary Area: Social and economic aspects of datasets and benchmarks in machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)

Submission Number: 1110

Loading