Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Hadas Orgad; Boyi Wei; Kaden Zheng; Martin Wattenberg; Peter Henderson; Seraphina Goldfarb-Tarrant; Yonatan Belinkov

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

Hadas Orgad, Boyi Wei, Kaden Zheng, Martin Wattenberg, Peter Henderson, Seraphina Goldfarb-Tarrant, Yonatan Belinkov

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Interpretability for AI Safety, Methods (probing, steering, causal interventions)

Other Keywords: harmfulness,weight analysis,pruning,jailbreaks

TL;DR: Harmful content generation in LLMs relies on a compact, generalizable set of weights distinct from benign capabilities; it helps explain emergent misalignment and suggests a route to more principled safety interventions.

Abstract: Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce "emergent misalignment" that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally—despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

Submission Number: 399

Loading