Abstract: The test-time scaling techniques demonstrate a potential direction to improve the reasoning abilities of LLMs by searching the reasoning space with a score function. Although the test-time scaling methods have been widely studied for math-reasoning tasks, the inference scaling capabilities of LLMs for commonsense reasoning remain largely underexplored. In this work, we examine the scalability of inference scaling techniques for commonsense reasoning by using a pretrained entailment verifier model as the score function. We also propose a new inference scaling method, called CORE-EVO, which integrates the evolutionary search algorithm with LLMs. The CORE-EVO is capable of tackling the shortcomings of the best-of-N and self-consistency, searching for the reasoning path in the local optima in reasoning spaces, by performing evolutionary operations and population refinement based on the entailment verification score. The experimental results on CommonsenseQA, PIQA, and SocialIQA benchmarks show that our method is able to scale inference compute more effectively than other test-time scaling techniques at high inference scales. Notably, our method outperforms the best-of-N and self-consistency by a significant gap, about $4\\%$ and $3\\%$ respectively, in terms of average performance using the Llama3.1-8B language model.
Paper Type: Long
Research Area: Question Answering
Research Area Keywords: commonsense QA, reasoning, inference scaling
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 2242
Loading