Keywords: Chain-of-Thought, entropy-guided, fine-tuning, reasoning, filtering
Abstract: Chain-of-Thought (CoT) prompting has significantly enhanced the mathematical reasoning capabilities of Large Language Models.
We find existing fine-tuning datasets frequently suffer from the "answer right but reasoning wrong" probelm, where correct final answers are derived from hallucinated, redundant, or logically invalid intermediate steps.
This paper proposes EntroCoT, a unified framework for automatically identifying and refining low-quality CoT supervision traces.
EntroCoT first proposes an entropy-based mechanism to segment the reasoning trace into multiple steps at uncertain junctures, and then introduces a Monte Carlo rollout-based mechanism to evaluate the marginal contribution of each step.
By accurately filtering deceptive reasoning samples, EntroCoT constructs a high-quality dataset where every intermediate step in each reasoning trace facilitates the final answer.
Extensive experiments on mathematical benchmarks demonstrate that fine-tuning on the subset constructed by EntroCoT consistently outperforms the baseslines of full-dataset supervision.
Our code is available in "software" appendix.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: Language Modeling, Efficient/Low-Resource Methods for NLP, Interpretability and Analysis of Models for NLP
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 4757
Loading