PRISM: Multidimensional Safety Risk Detection and Measurement Beyond Superficial Compliance for LLMs

PRISM: Multidimensional Safety Risk Detection and Measurement Beyond Superficial Compliance for LLMs

ACL ARR 2026 January Submission1586 Authors

30 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, LLM Safety, Mechanistic Interpretability, Severity Assessment, Value Alignment, Representation Probing, Prototypical Learning

Abstract: Safety alignment in Large Language Models (LLMs) often relies on "surface compliance" measures, treating refusal mechanisms as black boxes. We propose PRISM (Prototypical Representation for Internal Safety Mapping), a mechanistic framework that probes internal model states to map a two-stage Safety Spectrum. First, utilizing spectral analysis of model weights, we localize a distinct functional Safety Center where categorical risk taxonomies are distinguished through prototypical interactions within the latent manifold. Second, PRISM facilitates Rationalized Induction, calibrating violation magnitude into a transparent, multidimensional ordinal spectrum via a structured protocol. Our results demonstrate that by probing the model's internal judgment, a lightweight 8B backbone achieves superior calibration reliability and mechanistic transparency compared to massive closed-source judges like GPT-4o, providing a scalable foundation for fine-grained value alignment governance.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: safety and alignment, probing,calibration/uncertainty

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Data resources

Languages Studied: English

Submission Number: 1586

Loading