Hidden in Thought: Transferable Chain-of-Thought Artifacts Induce Harmful Behavior

ACL ARR 2026 May Submission15690 Authors

26 May 2026 (modified: 02 Jun 2026)ACL ARR 2026 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Harmful chain-of-thought, jailbreak attacks, compromised language models, emergent misalignment, refusal ablation, reasoning transfer, black-box attacks, LLooM concept mining, model safety, output-side safeguards.
Abstract: We investigate whether harmful chain-of-thought (CoT) traces from compromised language models can transfer unsafe behaviour and be distilled into reusable jailbreak attacks. Using an emergent-misalignment organism and a refusal-ablated jailbroken organism, we transplant harmful CoTs into $29$ open-source and $5$ closed-source targets. Transferred traces raise harmful-response rates above $80\%$ on the most vulnerable open-source models, while semantically mismatched CoTs fail entirely. LLooM concept mining identifies four recurring components of harmful reasoning: proceduralisation, ethical decoupling, evasion, and target--vulnerability framing. Distilling these patterns into reusable system prompts produces effective black-box jailbreaks, outperforming direct CoT transplantation on strongly aligned models by up to an order of magnitude, including a $10\times$ improvement on GPT-4.1 AdvBench. Reasoning-enabled models are more than twice as vulnerable, and output-side safeguards such as Llama-Guard~3 frequently miss harmful generations. Our results show that harmful reasoning transfers at both the trace and pattern levels, motivating defences that evaluate reasoning context in addition to final outputs.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: adversarial attacks, examples/training, calibration, contrastive explanation faithfulness, feature attribution,hardness of samples, hierarchical & concept explanations,knowledge tracing/discovering/inducing, model editing, probing
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data analysis
Languages Studied: English
EMNLP 2026 AI Reviewing Experiment: yes
Submission Number: 15690
Loading