NON-MONOTONICITY AND CATASTROPHIC RISK OF PROMPT INTERVENTIONS IN ADVERSARIAL LLM CONTROL

Koki Inoue; Naoya Takashima; Hayato Fujihara; SHUYA HIGUCHI; Kota Shimomura; Ryuta Shimogauchi; Takayoshi Yamashita

NON-MONOTONICITY AND CATASTROPHIC RISK OF PROMPT INTERVENTIONS IN ADVERSARIAL LLM CONTROL

Koki Inoue, Naoya Takashima, Hayato Fujihara, SHUYA HIGUCHI, Kota Shimomura, Ryuta Shimogauchi, Takayoshi Yamashita

Published: 02 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0

Keywords: Large Language Model, Prompt Engineering, Adversarial Control, Non-Monotonicity, Tail Risk

Abstract: Adding more rules to LLM prompts does not make them safer---it often makes them worse. Prior work has shown that longer prompts degrade LLM performance on standard benchmarks, but the effect under adversarial pressure---where opponents actively exploit weaknesses---remains unexplored. We address this gap by analyzing over 1,000 prompt expansion events in a competitive red-teaming contest (29,084 matches, 247 participants) and find that the median score change from adding text is zero. Furthermore, 11--17\% of expansions trigger severe performance collapse, with the worst cases losing more than half of their potential score. The pattern is counterintuitive and non-monotonic: for attacks, degradation decreases with moderate expansion but spikes at large magnitudes; for defenses, medium-length baselines suffer the worst outcomes. These findings expose a hidden danger in the ``more constraints is better'' heuristic widely used in LLM safety practice.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 40

Loading