Keywords: N-GLARE, LLM safety evaluation, latent representations, trajectory geometry, Angular-Probabilistic Trajectory (APT), Jensen–Shannon Separability (JSS), red teaming, safety ranking, evaluation
Abstract: Evaluating the safety robustness of LLMs is critical for their deployment. However, mainstream Red Teaming methods rely on online generation and black-box output analysis. These approaches are not only costly but also suffer from feedback latency, making them unsuitable for agile diagnostics after training a new model.
To address this, we propose N-GLARE (A Non-Generative, Latent Representation-Efficient LLM Safety Evaluator). N-GLARE operates entirely on the model's latent representations, bypassing the need for full text generation. It characterizes hidden layer dynamics by analyzing the APT (Angular-Probabilistic Trajectory) of latent representations and introducing the JSS (Jensen-Shannon Separability) metric.
Experiments on over 40 models and 20 red teaming strategies demonstrate that the JSS metric exhibits high consistency with Red Teaming safety rankings at less than 1% token and runtime cost.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: security and privacy, red teaming,adversarial attacks/examples/training, probing, robustness
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 5522
Loading