Probing Reasoning Flaws and Safety Hierarchies with Chain-of-Thought Difference Amplification

Kamesh R

Probing Reasoning Flaws and Safety Hierarchies with Chain-of-Thought Difference Amplification

Kamesh R

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, LLM Safety, AI Alignment, Red Teaming, Model Probing, Interpretability, Chain-of-Thought, Logit Steering, Latent Behavior, Failure Detection, Safety Guardrails

TL;DR: This paper introduces a logit-steering technique that amplifies differences in chain-of-thought reasoning to systematically bypass LLM safety guardrails and reveal a hierarchy of hidden vulnerabilities.

Abstract: Detecting rare but critical failures in Large Language Models (LLMs) is a pressing challenge for safe deployment, as vulnerabilities introduced during alignment are often missed by standard benchmarks. We introduce Chain-of-Thought Difference (CoT Diff) Amplification, a logit-steering technique that systematically probes model reasoning. The method steers inference by amplifying the difference between outputs conditioned on two contrastive reasoning paths, allowing for targeted pressure-testing of a model’s behavioral tendencies. We apply this technique to a base model and a domain-adapted variant across a suite of safety and factual-coherence benchmarks. Our primary finding is the discovery of a clear hierarchy in the model’s safety guardrails: while the model refuses to provide unethical advice or pseudoscience at baseline, it readily generates detailed misinformation when prompted with a specific persona, revealing a critical vulnerability even without amplification.

Submission Number: 180

Loading