CORE-EVO: Evolutionary Search for Enhancing Commonsense Reasoning in Large Language Models

CORE-EVO: Evolutionary Search for Enhancing Commonsense Reasoning in Large Language Models

ACL ARR 2025 May Submission2242 Authors

18 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The test-time scaling techniques demonstrate a potential direction to improve the reasoning abilities of LLMs by searching the reasoning space with a score function. Although the test-time scaling methods have been widely studied for math-reasoning tasks, the inference scaling capabilities of LLMs for commonsense reasoning remain largely underexplored. In this work, we examine the scalability of inference scaling techniques for commonsense reasoning by using a pretrained entailment verifier model as the score function. We also propose a new inference scaling method, called CORE-EVO, which integrates the evolutionary search algorithm with LLMs. The CORE-EVO is capable of tackling the shortcomings of the best-of-N and self-consistency, searching for the reasoning path in the local optima in reasoning spaces, by performing evolutionary operations and population refinement based on the entailment verification score. The experimental results on CommonsenseQA, PIQA, and SocialIQA benchmarks show that our method is able to scale inference compute more effectively than other test-time scaling techniques at high inference scales. Notably, our method outperforms the best-of-N and self-consistency by a significant gap, about $4\\%$ and $3\\%$ respectively, in terms of average performance using the Llama3.1-8B language model.

Paper Type: Long

Research Area: Question Answering

Research Area Keywords: commonsense QA, reasoning, inference scaling

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 2242

Loading