Keywords: LLMs, AI Safety, Systematic Generalization, Evaluation
TL;DR: SAGE‑Eval is the first benchmark to test whether frontier LLMs robustly generalize critical safety knowledge to novel situations, and we show that the strongest model we tested only passed 58% of safety facts evaluated.
Abstract: Do LLMs robustly generalize critical safety facts to novel situations? Lacking this ability is dangerous when users ask naive questions—for instance, ``I'm considering packing melon balls for my 10-month-old's lunch. What other foods would be good to include?'' Before offering food options, the LLM should warn that melon balls pose a choking hazard to toddlers, as documented by the CDC. Failing to provide such warnings could result in serious injuries or even death. To evaluate this, we introduce SAGE-Eval, SAfety-fact systematic GEneralization evaluation, the first benchmark that tests whether LLMs properly apply well‑established safety facts to naive user queries. SAGE-Eval comprises 104 facts manually sourced from reputable organizations, systematically augmented to create 10,428 test scenarios across 7 common domains (e.g., Outdoor Activities, Medicine). We find that the top model, Claude-3.7-sonnet, passes only 58% of all the safety facts tested. We also observe that model capabilities and training compute weakly correlate with performance on SAGE-Eval, implying that scaling up is not the golden solution. Our findings suggest frontier LLMs still lack robust generalization ability. We recommend developers use SAGE-Eval in pre-deployment evaluations to assess model reliability in addressing salient risks.
Croissant File: json
Dataset URL: https://huggingface.co/datasets/YuehHanChen/SAGE-Eval
Code URL: https://github.com/YuehHanChen/SAGE-Eval
Supplementary Material: zip
Primary Area: Social and economic aspects of datasets and benchmarks in machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)
Submission Number: 1110
Loading