LatentShield: Leveraging Safety Patterns in Latent Space

LatentShield: Leveraging Safety Patterns in Latent Space

ACL ARR 2025 July Submission952 Authors

29 Jul 2025 (modified: 08 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: While LLMs undergo extensive safety training, whether they internally encode these distinctions remains an open question. This paper investigates intrinsic safety patterns in the latent space of large language models (LLMs), examining how safety-aligned models distinguish between safe and unsafe inputs at the representation level. We perform a comprehensive analysis across 10 models and several datasets including safe, unsafe, and adversarial datasets. We show that LLMs implicitly encode safety-related patterns within their activation space, which can be leveraged for proactive safety detection. We introduce LatentShield, a mechanism for early unsafe input detection using representations. LatentShield outperforms state-of-the-art safety shields, LlamaGuard 2 and LlamaGuard 3, by up to 42% when tested on the most challenging unsafe dataset, Q-Harm. On adversarial attacks, LlamaGuards’ performance collapses to 25% in comparison to 58.5% of LatentShield. Our findings strongly advocate that representations of a model can be leveraged to build a high-performing lightweight model-specific safety shield.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Large Language Models, Interpretability, AI Alignment, AI Safety, Latent Space Analysis

Contribution Types: Model analysis & interpretability

Languages Studied: English

Reassignment Request Area Chair: This is not a resubmission

Reassignment Request Reviewers: This is not a resubmission

A1 Limitations Section: This paper has a limitations section.

A2 Potential Risks: Yes

A2 Elaboration: Ethics Statement

B Use Or Create Scientific Artifacts: Yes

B1 Cite Creators Of Artifacts: Yes

B1 Elaboration: Section 3 / Section 5

B2 Discuss The License For Artifacts: No

B2 Elaboration: We did not discuss the licenses for the datasets and models used in this study because all artifacts were sourced from either publicly released research datasets/models or publicly accessible web content. As our focus was on empirical analysis of latent space representations and not on redistribution or modification of these artifacts, we did not include license details in the paper.

B3 Artifact Use Consistent With Intended Use: No

B3 Elaboration: We did not explicitly discuss intended use. While most datasets were publicly released for research, a small portion of prompts was crawled from publicly accessible websites. These were used strictly for non-commercial research purposes.

B4 Data Contains Personally Identifying Info Or Offensive Content: No

B4 Elaboration: The collected web prompts were manually reviewed to ensure they did not contain any personally identifiable information. Some datasets (e.g., HarmBench, MaliciousInstruct) intentionally include offensive content by design to test model safety. These were used as-is, strictly for research and evaluation of safety detection systems. No private user data was included.

B5 Documentation Of Artifacts: No

B5 Elaboration: While we provided dataset names, sample counts, and categories (safe, unsafe, adversarially safe), we did not include detailed documentation on linguistic properties, domain coverage, or demographic representation. This is because most datasets were sourced from prior work that did not include such metadata, and our primary focus was on model activation patterns rather than linguistic or demographic analysis.

B6 Statistics For Data: Yes

B6 Elaboration: Section 3

C Computational Experiments: Yes

C1 Model Size And Budget: No

C1 Elaboration: No, since we used standard models whose parameter sizes are publicly known and well-documented.

C2 Experimental Setup And Hyperparameters: No

C2 Elaboration: We did not perform any hyperparameter search and used standard settings.

C3 Descriptive Statistics: Yes

C3 Elaboration: Section 5

C4 Parameters For Packages: Yes

C4 Elaboration: Section 5

D Human Subjects Including Annotators: No

D1 Instructions Given To Participants: N/A

D2 Recruitment And Payment: N/A

D3 Data Consent: N/A

D4 Ethics Review Board Approval: N/A

D5 Characteristics Of Annotators: N/A

E Ai Assistants In Research Or Writing: Yes

E1 Information About Use Of Ai Assistants: No

E1 Elaboration: We did not include information about AI assistant use, as it was limited to minor support tasks such as grammar correction and code writing assistance.

Author Submission Checklist: yes

Submission Number: 952

Loading