LatentShield: Leveraging Safety Patterns in Latent Space

ACL ARR 2025 July Submission952 Authors

29 Jul 2025 (modified: 08 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While LLMs undergo extensive safety training, whether they internally encode these distinctions remains an open question. This paper investigates intrinsic safety patterns in the latent space of large language models (LLMs), examining how safety-aligned models distinguish between safe and unsafe inputs at the representation level. We perform a comprehensive analysis across 10 models and several datasets including safe, unsafe, and adversarial datasets. We show that LLMs implicitly encode safety-related patterns within their activation space, which can be leveraged for proactive safety detection. We introduce LatentShield, a mechanism for early unsafe input detection using representations. LatentShield outperforms state-of-the-art safety shields, LlamaGuard 2 and LlamaGuard 3, by up to 42% when tested on the most challenging unsafe dataset, Q-Harm. On adversarial attacks, LlamaGuards’ performance collapses to 25% in comparison to 58.5% of LatentShield. Our findings strongly advocate that representations of a model can be leveraged to build a high-performing lightweight model-specific safety shield.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Large Language Models, Interpretability, AI Alignment, AI Safety, Latent Space Analysis
Contribution Types: Model analysis & interpretability
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Ethics Statement
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Section 3 / Section 5
B2 Discuss The License For Artifacts: No
B2 Elaboration: We did not discuss the licenses for the datasets and models used in this study because all artifacts were sourced from either publicly released research datasets/models or publicly accessible web content. As our focus was on empirical analysis of latent space representations and not on redistribution or modification of these artifacts, we did not include license details in the paper.
B3 Artifact Use Consistent With Intended Use: No
B3 Elaboration: We did not explicitly discuss intended use. While most datasets were publicly released for research, a small portion of prompts was crawled from publicly accessible websites. These were used strictly for non-commercial research purposes.
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: The collected web prompts were manually reviewed to ensure they did not contain any personally identifiable information. Some datasets (e.g., HarmBench, MaliciousInstruct) intentionally include offensive content by design to test model safety. These were used as-is, strictly for research and evaluation of safety detection systems. No private user data was included.
B5 Documentation Of Artifacts: No
B5 Elaboration: While we provided dataset names, sample counts, and categories (safe, unsafe, adversarially safe), we did not include detailed documentation on linguistic properties, domain coverage, or demographic representation. This is because most datasets were sourced from prior work that did not include such metadata, and our primary focus was on model activation patterns rather than linguistic or demographic analysis.
B6 Statistics For Data: Yes
B6 Elaboration: Section 3
C Computational Experiments: Yes
C1 Model Size And Budget: No
C1 Elaboration: No, since we used standard models whose parameter sizes are publicly known and well-documented.
C2 Experimental Setup And Hyperparameters: No
C2 Elaboration: We did not perform any hyperparameter search and used standard settings.
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 5
C4 Parameters For Packages: Yes
C4 Elaboration: Section 5
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: We did not include information about AI assistant use, as it was limited to minor support tasks such as grammar correction and code writing assistance.
Author Submission Checklist: yes
Submission Number: 952
Loading