OptiBench: Benchmarking Large Language Models in Optimization Modeling with Equivalence-Detection Evaluation

Zhuohan Wang; Ziwei Zhu; Yizhou Han; Yufeng Lin; Zhihang Lin; Ruoyu Sun; Tian Ding

OptiBench: Benchmarking Large Language Models in Optimization Modeling with Equivalence-Detection Evaluation

Zhuohan Wang, Ziwei Zhu, Yizhou Han, Yufeng Lin, Zhihang Lin, Ruoyu Sun, Tian Ding

28 Sept 2024 (modified: 24 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, benchmark, AI for OR, optimization modeling, autonomous mathematical formulation

TL;DR: We introduce a comprehensive benchmark to assess LLMs' ability in optimization modeling, provided with a theoretically guaranteed evaluation method.

Abstract: In operations research (OR), formulating optimization problems in industrial applications is often time-consuming and requires specialized expertise. Recently, large language models (LLMs) have shown remarkable potential to automate this process. However, evaluating the performance of LLMs in optimization modeling remains challenging due to the scarcity of suitable datasets and rigorous evaluation methodologies. To reduce this gap, we introduce OptiBench, a new benchmark designed to assess LLMs' ability to formulate linear programming (LP) and mixed-integer linear programming (MILP) models. OptiBench provides a diverse dataset covering 816 optimization modeling word problems across 16 problem classes and over 80 practical domains. It also adopts a model-data separation format with 2 levels of description abstraction. The dataset exhibits the complexity of real-world optimization problems compared to traditional textbook examples. OptiBench incorporates a new evaluation method based on a modified Weisfeiler-Lehman graph isomorphism test (WL-test) algorithm. We theoretically prove that this method can correctly judge whether two models are equivalent or not, setting a new standard for automatically validating the correctness of optimization modeling. We benchmark various LLMs using OptiBench and observe significant performance differences. GPT-4o by direct prompting achieves 49.39\% overall accuracy, outperforming other models and LLM-based agents, including OpenAI o1 (preview and mini). Notably, GPT-4o's performance varies across different problem classes, achieving over 90\% accuracy on the knapsack problem class but falling below 5\% on the traveling salesman problem class. These findings provide new insights into the strengths and limitations of LLMs in optimization modeling.

Supplementary Material: pdf

Primary Area: datasets and benchmarks

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 13862

Loading