TL;DR: We presents a bidirectional data synthesis framework, termed OptMATH, for constructing datasets on optimization modeling tasks. Experiment results show that models trained on OptMATH achieve better performance on various benchmarks.
Abstract: Despite the rapid development of large language models (LLMs), a fundamental challenge persists: the lack of high-quality optimization modeling datasets hampers LLMs' robust modeling of practical optimization problems from natural language descriptions (NL). This data scarcity also contributes to the generalization difficulties experienced by learning-based methods.
To address these challenges, we propose a scalable framework for synthesizing a high-quality dataset, named OptMATH. Starting from curated seed data with mathematical formulations (MF), this framework automatically generates problem data (PD) with controllable complexity. Then, a back-translation step is employed to obtain NL. To verify the correspondence between the NL and the PD, a forward modeling step followed by rejection sampling is used. The accepted pairs constitute the training part of OptMATH. Then a collection of rejected pairs is identified and further filtered. This collection serves as a new benchmark for optimization modeling, containing difficult instances whose lengths are much longer than these of NL4OPT and MAMO.
Through extensive experiments, we demonstrate that models of various sizes (0.5B-32B parameters) trained on OptMATH achieve superior results on multiple modeling benchmarks, thereby validating the effectiveness and scalability of our approach. The OptMATH dataset and related resources are available at \url{https://github.com/optsuite/OptMATH}.
Lay Summary: Imagine you want to find the best way to do something, like the most efficient route for deliveries or the most profitable way to schedule tasks. Describing these problems in everyday language is easy for us, but computers need precise mathematical instructions to solve them. Teaching computers to make this translation from words to math is a big challenge because they need many good examples to learn from, and these are often hard to come by.
Our work introduces a new system called OptMATH that cleverly creates large numbers of high-quality examples for this purpose. OptMATH starts with the core mathematical structure of a problem and then generates a relatable word description for it. Afterwards, it carefully checks if this word description accurately reflects the underlying math.
The correctly matched examples become excellent training material for artificial intelligence (AI) systems. Interestingly, problems that were initially mismatched but still valid are collected and refined to create a new set of tough test questions for these AIs. Our experiments show that AI systems trained with the examples generated by OptMATH become significantly better at understanding everyday problem descriptions and setting them up for solution.
Primary Area: Optimization
Keywords: Optimization, LLM, Optimization Modeling, Synthetic Data
Submission Number: 9731
Loading