Abstract: Large Language Models (LLMs) are commonly evaluated on instruction-following tasks with explicitly specified constraints, where difficulty typically arises from complex reasoning or long input contexts. In this work, we identify and study a distinct and largely unexplored challenge: the identification of constraint relevance. We introduce Large-Scale Constraint Generation (LSCG), a setting in which an LLM is given a large pool of valid constraints that apply in general, but must autonomously determine which subset of them is binding for a specific task instance, and then complete the task while satisfying only those constraints.
Unlike existing benchmarks, the applicable constraints are not explicitly enumerated, isolating a fundamental capability required for realistic, autonomous LLM deployment. To investigate this problem, we propose two tasks: Word Checker, which isolates constraint
identification by requiring models to detect violations from long lists of forbidden words, and Language Moderator, which extends this setting to constrained generation. We evaluate multiple model families (LLaMA and R1), model sizes (8B and 70B), and inference
strategies (simple prompting, Chain-of-Thought, and Best-of-N) under increasing numbers of candidate constraints. Our results show that performance degrades sharply as constraint lists grow, dropping to near-random accuracy even for large models, indicating that scaling alone does not resolve this failure mode. Inspired by retrieval-augmented methods, we introduce FoCusNet (Focused Constraints Net), a lightweight preprocessing model that filters constraint lists to likely relevant candidates, yielding consistent accuracy improvements. Our indings highlight the identification of the relevance of the constraints as a fundamental and under-explored dimension of the following instruction.
Submission Length: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Alessandro_Sperduti1
Submission Number: 6031
Loading