Harmfulness Propagation Dynamics: Tracing Adversarial Intent Across LLM Layers
Track: long paper (up to 10 pages)
Keywords: Large Language Models, AI Safety, Input Moderation, Jailbreak Detection, Activation Dynamics, Representation Engineering, Trajectory Features, Linear Discriminant Analysis
TL;DR: We efficiently detect harmful LLM prompts and adversarial jailbreaks by analyzing the progressive trajectory of hidden states across transformer layers rather than relying on a single snapshot.
Abstract: We study how harmful intent emerges across transformer layers and identify Harmfulness Propagation Dynamics (HPD): for harmful prompts, the projection of the last-token hidden state onto a learned harm direction increases with depth and becomes strongly positive in late layers, whereas benign prompts remain flat or oscillatory. Based on this, we introduce HERALD (Harmful Encoding Recognition via Activation Layer Dynamics), an input moderator that classifies the shape of the cross-layer trajectory rather than a single-layer snapshot. HERALD stores one d-dimensional direction per layer, requiring only 262 KB for a 32-layer model, and needs no gradient computation during training. Across eight prompt-harmfulness benchmarks and four backbone families, HERALD attains an average F1 of 89.3 on OLMO2-7B, outperforms all tested guard models on adversarial jailbreak detection, and surpasses prior latent-based methods while providing a visualizable per-instance harmfulness trajectory.
Presenter: ~Noor_Islam_S._Mohammad1
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 178
Loading