Keywords: Large language models; Interpretability; AI safety
Abstract: LLMs are trained to refuse harmful instructions, but do they truly understand
harmfulness beyond just refusing? Prior work has shown that LLMs’ refusal
behaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction.
In this work, we identify a new dimension to analyze safety mechanisms in LLMs,
i.e., harmfulness, which is encoded internally as a separate concept from refusal.
And there exists a harmfulness direction that is distinct from the refusal direction.
As causal evidence, steering along the harmfulness direction can lead LLMs to
interpret harmless instructions as harmful, but steering along the refusal direction
tends to elicit refusal responses directly without reversing the model’s judgment on
harmfulness. Furthermore, using our identified harmfulness concept, we find that
certain jailbreak methods work by reducing the refusal signals without suppressing
the model’s internal belief of harmfulness. We also find that adversarially fine-
tuning models to accept harmful instructions has minimal impact on the model’s
internal belief of harmfulness. These insights lead to a practical safety application:
The model’s latent harmfulness representation can serve as an intrinsic safeguard
(Latent Guard) for detecting unsafe inputs and reducing over-refusals that is
robust to finetuning attacks. For instance, our Latent Guard achieves performance
comparable to or better than Llama Guard 3 8B, a dedicated finetuned safeguard
model, across different jailbreak methods. Our findings suggest that LLMs’
internal understanding of harmfulness is more robust than their refusal decision
to diverse input instructions, offering a new perspective to study AI safety.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 25564
Loading