Keywords: Guardrail Model; LLM; Attention Dilution
Abstract: Guardrail models are classifiers deployed to screen malicious prompts and responses in LLM-based services. To meet latency constraints, many lightweight guardrails adopt compact Transformer backbones (e.g., DeBERTa) that are trained with short context windows (typically around 512 tokens) and rely on bucketed relative positional encodings to operate on longer inputs. Prior evaluations largely assume that a guardrail's decision is stable as the input is lengthened.
We show that this assumption can fail. We identify \emph{Overflip}, a repetition-induced instability where simply repeating a prompt causes the guardrail's prediction to flip as the sequence grows, enabling a practical overflip-style bypass (MAL$\to$BEN) without semantic manipulation. Across 9 widely used lightweight guardrail models, 5 exhibit MAL$\to$BEN flips on a benchmark of 100 prompts, with confidence margins shrinking steadily with repetition.
Across 9 widely used lightweight guardrail models, 5 exhibit MAL$\to$BEN flips on a benchmark of 100 prompts; among vulnerable models, flip rates range from 8\% to 87\%, with first flips occurring at roughly 2.6k--9.4k tokens.
Our analysis suggests Overflip is not explained by traditional attention-dilution baselines (e.g., benign padding or shuffling): despite preserving content, repetition can homogenize token-level attention over repeated structure and induce a distinct, more gradual attention-dispersion trajectory than padding. Because the bypassed prompt remains semantically intact and is still readily understood by downstream business LLMs, it can transmit malicious intent after passing the guardrail. These findings expose input length as an attack surface for safety filters and motivate length-robust evaluation and mitigation for lightweight guardrail deployments.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: LLM; Guardrail model; Safety
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 7861
Loading