Keywords: LLM evaluation, LLM Agent, Large-Scale Search Spaces Optimization
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in solving a wide range of tasks. However, their ability to iteratively optimize complex solutions by learning from previous feedback remains underexplored. To address this gap, we introduce \textbf{OPT-BENCH}, a comprehensive benchmark designed to evaluate LLM agents on large-scale search space optimization problems. OPT-BENCH includes 20 real-world machine learning tasks sourced from Kaggle and 10 classical NP problems, providing a diverse and challenging environment for assessing LLMs on iterative reasoning and solution refinement.
To facilitate rigorous evaluation, we present \textbf{OPT-Agent}, an end-to-end optimization framework that emulates human reasoning by generating, validating, and iteratively improving solutions through the use of historical feedback. Through extensive experiments involving 17 state-of-the-art LLMs from 7 model families, including reasoning models, general models, and open-source models ranging from 3B to 72B parameters, we demonstrate that incorporating historical context significantly enhances optimization performance across both ML and NP tasks. However, this benefit remains limited, as even with the latest models, a gap still persists compared to human expert performance.
All datasets, code, and evaluation tools will be open-sourced to foster further research in advancing LLM-driven optimization and iterative reasoning.
Primary Area: datasets and benchmarks
Submission Number: 9242
Loading