Breaking the Chain: Direct Prompting’s Unexpected Advantage in CommonSense Reasoning

Breaking the Chain: Direct Prompting’s Unexpected Advantage in CommonSense Reasoning

ACL ARR 2026 January Submission3145 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Chain-of-Thought, prompting strategies, uncertainty quantification, entropy, calibration, commonsense reasoning, cognitive QA, self-correction, large language models, computational efficiency

Abstract: We investigate entropy-based uncertainty quantification in large language models on TinyCommonSenseQA tasks, comparing direct (non-CoT) answers with Chain-of-Thought (CoT) reasoning. Through experiments on 52 carefully curated TinyCommonSenseQA questions using GPT-4o and fine-tuning studies on Qwen 2.5-7B, we find that: (1) Direct (non-CoT) answers exhibit superior calibration with significantly stronger entropy separability between correct and incorrect answers than CoT, revealing better uncertainty quantification for primitive reasoning tasks. (2) Direct (non-CoT) answers achieve 94% of CoT's accuracy while using 3.7× fewer tokens, revealing substantial computational advantages. (3) Both strategies show identical post-training performance convergence, exposing the absence of specialized primitive reasoning optimization in current RL-based post-training pipelines. (4) Sequential scaling reveals self-correction capabilities in direct (non-CoT) answers (+3.8% improvement) indicated by entropy increases during corrections, while CoT remains static. (5) Supervised fine-tuning followed by reinforcement learning (SFT→RL) significantly outperforms RL cold-start approaches, highlighting the importance of format familiarization in reasoning enhancement. These findings challenge CoT's universal superiority and establish task-dependent strategy selection frameworks for large language models.

Paper Type: Short

Research Area: LLM Efficiency

Research Area Keywords: uncertainty quantification, entropy-based calibration, cognitive reasoning, self-correction, test-time compute, model evaluation

Contribution Types: Model analysis & interpretability, NLP engineering experiment, Approaches low compute settings-efficiency, Data resources

Languages Studied: English

Submission Number: 3145

Loading