Keywords: Large Language Model, Prompt Engineering, Adversarial Control, Non-Monotonicity, Tail Risk
Abstract: Adding more rules to LLM prompts does not make them safer---it often makes them worse.
Prior work has shown that longer prompts degrade LLM performance on standard benchmarks, but the effect under adversarial pressure---where opponents actively exploit weaknesses---remains unexplored.
We address this gap by analyzing over 1,000 prompt expansion events in a competitive red-teaming contest (29,084 matches, 247 participants) and find that the median score change from adding text is zero.
Furthermore, 11--17\% of expansions trigger severe performance collapse, with the worst cases losing more than half of their potential score.
The pattern is counterintuitive and non-monotonic: for attacks, degradation decreases with moderate expansion but spikes at large magnitudes; for defenses, medium-length baselines suffer the worst outcomes.
These findings expose a hidden danger in the ``more constraints is better'' heuristic widely used in LLM safety practice.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 40
Loading