Large Language Models Can Follow Instructions, But Not Many at Once: Phase Transitions in Compositional Constraint Satisfaction

Published: 27 May 2026, Last Modified: 27 May 2026CompLearn 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: instruction following, compositional evaluation, constraint satisfaction, phase transitions, deterministic verification, LLM benchmarks, vision-language models
TL;DR: LLMs decay exponentially under compositional constraint load, plateauing at ~16% beyond 5–6 constraints via multiplicative accumulation of independent failures. In vision, decay is model-dependent, governed by perceptual accuracy.
Abstract: Large language models are increasingly deployed in settings that require simultaneous adherence to multiple explicit constraints. While individual constraints are often handled proficiently, the compositional regime -- where many must hold jointly -- remains poorly characterized: how rapidly does performance degrade, what governs the degradation, and is there a predictable ceiling? We introduce Constraint Saturation Evaluation (CSE), a procedurally generated stress-testing benchmark that systematically varies the number of simultaneous constraints ($k$), each scored by fully deterministic, rule-based verifiers with zero LLM-judge involvement. We apply CSE to two structurally different tasks: constrained text generation (30 constraints, 14 models, $k$=1-12, 82,000+ samples) and constrained visual selection from procedurally generated scenes (18 constraints, 10 VLMs, $k$=1-8, 70,410 samples verified against hidden scene graphs). In text, three principal findings emerge. First, every model obeys \emph{a two-regime decay pattern}: performance drops exponentially with constraint count before plateauing at an asymptotic floor ($\sim$16\%). A three-parameter model captures this trajectory and predicts held-out performance within 1.4 percentage points. Second, constraint failures are \emph{approximately independent} (mean $\varphi$=0.017, 87\% of pairs within $|\varphi|{\leq}$0.05), implying that performance collapse arises from multiplicative accumulation of individual failures rather than pairwise interference. Third, a \emph{depth-of-processing hierarchy} governs which constraints fail first: relational constraints degrade $2.4{\times}$ faster than lexical ones under compositional load---yet this ordering is uncorrelated with intrinsic single-constraint difficulty ($\rho$=-0.03), pointing to a general capacity limit rather than constraint-specific bottlenecks. Across all models, reliable compositional instruction following breaks down beyond 5--6 simultaneous constraints. In vision, the pattern is qualitatively different: the decay is \emph{model-dependent}. Four of 10 VLMs show compositional decay comparable to text ($-17$ to $-28$pp), while 6 plateau or invert, and constraint failures are massively correlated ($\varphi$=0.89) rather than independent---a single perception factor explains 87.9\% of co-failure variance. Whether a VLM degrades under compositional load depends on its perceptual accuracy: models that perceive the scene accurately (>60\%) experience near-zero compositional cost, while weaker perceivers (<50\%) show decay comparable to text. All verifiers, probes, scenes, and evaluation code will be released upon publication.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 113
Loading