Keywords: Large Language Models, Formal Method, Benchmark Generation, Evaluation Framework
Abstract: Recent advances in large language models(LLMs) have demonstrated impressive performance on a wide range of mathematic benchmarks.
Yet a critical challenge remains: systematic generalization, which is the ability to correctly reason about novel combinations and unseen contexts.
We present a formal methods-based evaluation framework called FORGE for rigorously probing the systematic generalization abilities of LLMs.
Our approach automatically synthesizes formal benchmarks from traditional datasets, ensuring that evaluation instances are both novel and valid.
Each formal benchmark is verified for correctness and well-posedness through formal methods.
We further introduce a formally grounded difficulty metric and a stepwise prompting method to enhance the rigorous evaluation.
Finally, we perform online evaluation multiple times and generates multiple benchmarks, ensuring novel combinations and unseen contexts for every run.
Experimental results reveal a dramatic accuracy drop in top-performing LLMs, highlighting critical weaknesses of LLMs.
Moreover, our analysis shows that this decline persists after controlling for problem hardness and multiple randomization, indicating that our framework not only mitigates contamination but also provides a principled scale for reasoning difficulty.
Primary Area: datasets and benchmarks
Submission Number: 5607
Loading