Keywords: Harmful chain-of-thought, jailbreak attacks, compromised language models, emergent misalignment, refusal ablation, reasoning transfer, black-box attacks, LLooM concept mining, model safety, output-side safeguards.
Abstract: We investigate whether harmful chain-of-thought (CoT) traces from
compromised language models can transfer unsafe behaviour and be
distilled into reusable jailbreak attacks. Using an emergent-misalignment
organism and a refusal-ablated jailbroken organism, we transplant
harmful CoTs into $29$ open-source and $5$ closed-source targets.
Transferred traces raise harmful-response rates above $80\%$ on the most
vulnerable open-source models, while semantically mismatched CoTs fail
entirely. LLooM concept mining identifies four recurring components of
harmful reasoning: proceduralisation, ethical decoupling, evasion, and
target--vulnerability framing. Distilling these patterns into reusable
system prompts produces effective black-box jailbreaks, outperforming
direct CoT transplantation on strongly aligned models by up to an order
of magnitude, including a $10\times$ improvement on GPT-4.1 AdvBench.
Reasoning-enabled models are more than twice as vulnerable, and
output-side safeguards such as Llama-Guard~3 frequently miss harmful
generations. Our results show that harmful reasoning transfers at both
the trace and pattern levels, motivating defences that evaluate
reasoning context in addition to final outputs.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: adversarial attacks, examples/training, calibration, contrastive explanation faithfulness, feature attribution,hardness of samples, hierarchical & concept explanations,knowledge tracing/discovering/inducing, model editing, probing
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 15690
Loading