Keywords: Large Language Models, Operations Research, Optimization, Benchmarking, Synthetic Data Generation, Mathematical Reasoning
Abstract: Operations research (OR)-style modeling poses challenges for large language models (LLMs). It requires long-context consistency, producing precise mathematical formulations, and the ability to infer implicit constraints. To study these challenges under controlled conditions, we build a verifiable synthetic pipeline that generates large-scale certified optimization problem instances.
Using this pipeline, we obtain several insights: first, direct natural language translation of optimization problems runs into an \emph{effective context limit}, beyond which frontier models abruptly fail to maintain global variable–constraint consistency---despite remaining within nominal context window length. Second, naive divide-and-conquer scaling strategies struggle due to context explosion and semantic fragmentation. Third, while frontier models can reliably infer high-level optimization structure they struggle to correctly bind large, dense numerical data to variables at scale. Taken together, these findings identify important limitations for current LLM-based optimization approaches. For example, we synthesize an OR task where GPT-5 nano has an effective reasoning context limit of only $\sim$2,000 tokens and suffers a more than 50\% performance drop.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 47
Loading