N-GLARE: An Non-Generative Latent Representation-Efficient LLM Safety Evaluator

ACL ARR 2026 January Submission5522 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: N-GLARE, LLM safety evaluation, latent representations, trajectory geometry, Angular-Probabilistic Trajectory (APT), Jensen–Shannon Separability (JSS), red teaming, safety ranking, evaluation
Abstract: Evaluating the safety robustness of LLMs is critical for their deployment. However, mainstream Red Teaming methods rely on online generation and black-box output analysis. These approaches are not only costly but also suffer from feedback latency, making them unsuitable for agile diagnostics after training a new model. To address this, we propose N-GLARE (A Non-Generative, Latent Representation-Efficient LLM Safety Evaluator). N-GLARE operates entirely on the model's latent representations, bypassing the need for full text generation. It characterizes hidden layer dynamics by analyzing the APT (Angular-Probabilistic Trajectory) of latent representations and introducing the JSS (Jensen-Shannon Separability) metric. Experiments on over 40 models and 20 red teaming strategies demonstrate that the JSS metric exhibits high consistency with Red Teaming safety rankings at less than 1% token and runtime cost.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: security and privacy, red teaming,adversarial attacks/examples/training, probing, robustness
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 5522
Loading