Large Language Models Generate Harmful Content Using a Unified Mechanism

Hadas Orgad; Boyi Wei; Kaden Zheng; Martin Wattenberg; Peter Henderson; Seraphina Goldfarb-Tarrant; Yonatan Belinkov

Large Language Models Generate Harmful Content Using a Unified Mechanism

Hadas Orgad, Boyi Wei, Kaden Zheng, Martin Wattenberg, Peter Henderson, Seraphina Goldfarb-Tarrant, Yonatan Belinkov

Published: 05 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0

Track: long paper (up to 10 pages)

Keywords: safety, harmfulness, interpretability

TL;DR: We use pruning as a causal analysis tool and find that harmful generation is compressed into a small amount of weighs in aligned LLMs

Abstract: Large language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. It remains unclear whether this brittleness reflects fundamental limits of alignment or merely our failure to leverage model capabilities. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We demonstrate that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models show greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally---even as safety guardrails remain brittle at the surface. This explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm-generation weights in a narrow domain substantially reduces emergent misalignment. Notably, harm generation and reasoning are dissociable---models can lose the ability to produce harmful content while retaining the ability to recognize and explain it. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.

Presenter: ~Hadas_Orgad1

Format: Yes, the presenting author will attend in person if this work is accepted to the workshop.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Funding: No, the presenting author of this submission does *not* fall under ICLR’s funding aims, or has sufficient alternate funding.

Submission Number: 105

Loading