Exposing Critical Safety Failures: A Comprehensive Safety-Weighted Evaluation of LLaMA Models for Biochemical Toxicity Screening
Keywords: LLaMA, Safety-weighted evaluation, Biochemical toxicity screening, Robustness, Generative AI safeguards, AI safety
TL;DR: We introduce SWES, a safety-weighted metric revealing hidden false negatives in LLaMA toxicity screening, showing that accuracy can be misleading and larger models don’t always ensure safer predictions.
Abstract: Large language models are increasingly deployed for biochemical safety screening, yet standard evaluation metrics can obscure asymmetric risks where false negatives (missing hazardous compounds) pose especially serious safety risks compared to false positives. We present the first comprehensive safety-focused evaluation of LLaMA models across five critical biochemical datasets (Tox21, SIDER, BBBP, ClinTox, HIV) using our novel Safety-Weighted Error Score (SWES) that penalizes false negatives 5× more heavily than false positives. Our rigorous evaluation spans 30 experiments across 4 LLaMA model variants and 3 classical baselines, featuring multi-seed runs, bootstrap confidence intervals, and comprehensive cost analysis. Our findings reveal that traditional accuracy metrics can be misleading—models achieving 90%+ accuracy on HIV data exhibit poor SWES due to systematic false negatives that may allow hazardous compounds to pass safety screens. Surprisingly, classical baselines often outperform expensive LLaMA models, and larger models don’t consistently provide better safety performance. Our work identifies critical gaps in current evaluation practices and provides actionable insights for safer biochemical AI deployment. We release complete code, data, and artifacts for one-command reproduction to enable immediate adoption by the community.
Submission Number: 43
Loading