Overflip: Repetition-Induced Label Flips in Guardrail Models

Overflip: Repetition-Induced Label Flips in Guardrail Models

ACL ARR 2026 January Submission7861 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Guardrail Model; LLM; Attention Dilution

Abstract: Guardrail models are classifiers deployed to screen malicious prompts and responses in LLM-based services. To meet latency constraints, many lightweight guardrails adopt compact Transformer backbones (e.g., DeBERTa) that are trained with short context windows (typically around 512 tokens) and rely on bucketed relative positional encodings to operate on longer inputs. Prior evaluations largely assume that a guardrail's decision is stable as the input is lengthened. We show that this assumption can fail. We identify \emph{Overflip}, a repetition-induced instability where simply repeating a prompt causes the guardrail's prediction to flip as the sequence grows, enabling a practical overflip-style bypass (MAL$\to$BEN) without semantic manipulation. Across 9 widely used lightweight guardrail models, 5 exhibit MAL$\to$BEN flips on a benchmark of 100 prompts, with confidence margins shrinking steadily with repetition. Across 9 widely used lightweight guardrail models, 5 exhibit MAL$\to$BEN flips on a benchmark of 100 prompts; among vulnerable models, flip rates range from 8\% to 87\%, with first flips occurring at roughly 2.6k--9.4k tokens. Our analysis suggests Overflip is not explained by traditional attention-dilution baselines (e.g., benign padding or shuffling): despite preserving content, repetition can homogenize token-level attention over repeated structure and induce a distinct, more gradual attention-dispersion trajectory than padding. Because the bypassed prompt remains semantically intact and is still readily understood by downstream business LLMs, it can transmit malicious intent after passing the guardrail. These findings expose input length as an attack surface for safety filters and motivate length-robust evaluation and mitigation for lightweight guardrail deployments.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: LLM; Guardrail model; Safety

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 7861

Loading