Do LLMs Understand Constraint Programming? Zero-Shot Constraint Programming Model Generation Using LLMs

Published: 04 Apr 2025, Last Modified: 09 Jun 2025LION19 2025EveryoneRevisionsBibTeXCC BY 4.0
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Tracks: Main Track
Keywords: Constraint Programming, Large Language Models, Benchmark, MiniZinc
Abstract: Large language models (LLMs) have gained significant attention for their ability to solve complex tasks such as coding and reasoning. In this work, we aim to evaluate their ability to generate constraint programming (CP) models in a zero-shot setting, emphasizing model correctness and conformity to user-specified output formats. We propose a novel, iterative approach for zero-shot CP modeling that translates natural language problem descriptions into valid CP models and supports solution extraction to pre-defined output formats to facilitate effective adoption by domain experts and enable automated performance evaluation. To evaluate our approach, we introduce the Constraint Programming Evaluation (CPEVAL) benchmark, derived from a diverse set of CP problems in CSPLib, coupled with an automated evaluation suite for large-scale assessment. We augment CPEVAL with paraphrased variants to assess robustness across linguistic variation and mitigate bias in the evaluation due to data memorization. Our extensive experiments across eight prominent LLMs and two CP modeling languages, MiniZinc and PyCSP3, show that our proposed iterative Two-Step method significantly enhances model correctness and conformity to user-specified output formats. Furthermore, we observe that larger LLMs demonstrate superior performance, with DeepSeek-R1 emerging as the top performer across both CP languages. We also observe that LLMs generally perform better in MiniZinc than in PyCSP3.
Submission Number: 75
Loading