NLMOptimizer: A neurosymbolic framework and benchmark for operations research optimization problems from natural language

Alexander Michael Berenbeim; Ryan McNeil; Timeo Williams; Nathaniel D. Bastian

NLMOptimizer: A neurosymbolic framework and benchmark for operations research optimization problems from natural language

Alexander Michael Berenbeim, Ryan McNeil, Timeo Williams, Nathaniel D. Bastian

16 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Optimization, Symbolic Representation, NLP, Operations Research, Modeling

TL;DR: NLMOptimizer presents a dataset generator and framework for representing optimization problems in order to address inadequacies of present datasets for training and evaluating model performance.

Abstract: Large Language Models (LLMs) are increasingly applied to structured reasoning tasks, but remain prone to generating outputs that are both syntactically coherent and semantically invalid, posing a serious challenge for the domain of mathematical optimization. In particular, applications to operations research (OR) problems, where problem descriptions are often ambiguous, context-rich, and semantically dense, are compromised by these issues and a dearth of publicly available datasets appropriately designed for both training and benchmarking model performance. In this paper, we address these issues by first introducing \textbf{NLMOptimizer}, a neurosymbolic framework built on two classes: (i) the \textbf{Problem} class, which systematically generates optimization problems; and (ii) the \textbf{SymInterchange} class, an exploratory suite of neurosymbolic methods intended to map word problems into structured, solver-executable forms. We then address the dearth of plausibly complex OR problems with the associated NLMOptimizer dataset, generated using \textbf{Problem}, which pairs structured natural-language descriptions with solver-checked mathematical programs across 1000 different linear (LP) and quadratic programs (QP) across integer, mixed-integer, and continuous types. We evaluate four instruction-tuned LLMs (LLaMa-3.3, LLaMa-4-Scout, Gemini-1.5-Pro, GPT-OSS-120B) under zero-shot prompting and observe substantial degradation on our set, with the strongest model dropping from 66.6\% end-to-end accuracy on the NL4OPT benchmark dataset to 14.6\% on NLMOptimizer. Our results indicate that (i) widely used benchmarks understate the difficulty of mapping natural language to formal optimization structure, (ii) current LLMs struggle to represent even modestly more complex problems than LPs with 3 variables, and (iii) progress will require methods that directly target representational fidelity without training models to fit fixed examples.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 7886

Loading