Lexical Hints of Accuracy in LLM Reasoning Chains

TMLR Paper5694 Authors

21 Aug 2025 (modified: 21 Nov 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Fine-tuning Large Language Models (LLMs) with reinforcement learning to produce an explicit Chain-of-Thought (CoT) before answering, produces models that consistently raise overall performance on code, math, and general-knowledge benchmarks. However, on benchmarks where LLMs currently achieve low accuracy, such as Humanity's Last Exam (HLE), they often report high self-confidence, reflecting poor calibration. Here, we test whether measurable properties of the CoT provide reliable signals of an LLM's internal confidence in its answers. We analyze three feature classes: (i) CoT length, (ii) intra-CoT sentiment volatility, and (iii) lexicographic hints, including hedging words. Using DeepSeek-R1, Claude 3.7 Sonnet, and Qwen-235B-Think on Humanity's Last Exam (HLE), a frontier benchmark with very low accuracy, Omni-MATH, a saturated benchmark of moderate difficulty, and GPQA-diamond, a graduate level scientific reasoning benchmark, we find that lexical markers of uncertainty (e.g., \textit{guess}, \textit{stuck}, \textit{hard}) in the CoT are the strongest indicators of an incorrect response, while shifts in the CoT sentiment provide a weaker but complementary signal. CoT length is informative only on Omni-MATH and GPQA, where accuracy is already high ($\approx 70\%$), and carries no signal on the harder HLE ($\approx 9\%$), indicating that CoT length predicts correctness only in the intermediate-difficulty benchmarks, i.e., inside the model's demonstrated capability, but still below saturation. Finally, we find that uncertainty indicators in the CoT are consistently more salient than high-confidence markers, making errors easier to predict than correct responses. Our findings support a lightweight post-hoc calibration signal that complements unreliable self-reported probabilities and could support safer deployment of LLMs.
Submission Length: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=iSSthT3ZuI&noteId=iSSthT3ZuI
Changes Since Last Submission: Summary of Manuscript Revisions $\textbf{Experimental pipeline corrected for data leakage}$ Re-ran all analyses with proper nested cross-validation to ensure that feature discovery (top 25 “harmful” words) used only training folds. Clarified this procedure in Section 3.4 (pp. 9–10). $\textbf{Expanded benchmark–model scope}$ Increased from a 2×2 to a 3×3 evaluation matrix by adding: – New benchmark: GPQA-Diamond (graduate-level, non-math reasoning) – New model family: Qwen-235B-Think This broadens domain and difficulty coverage and strengthens generalization claims. $\textbf{Added domain-sensitivity analysis (Table 5)}$ Compared math vs. non-math subsets within HLE while jointly controlling for CoT length. → Lexical-only predictors perform worse on math than on non-math questions; adding CoT length removes this gap but reduces accuracy. Results and interpretation added on p. 10. $\textbf{Clarified the role of CoT length across benchmarks}$ Extended analysis to GPQA to show that CoT length predicts correctness for Omni-MATH and GPQA (intermediate-difficulty), but not for HLE. Clarified this result in Figure 1 and Table 5. $\textbf{Improved calibration reporting}$ Added full calibration breakdowns (ECE, MacroCE, Brier Score) per model × benchmark in Appendix B (p. 17), with new histograms illustrating overconfidence patterns. $\textbf{Added lexical frequency information}$ Reported occurrence counts for selected “harmful” and “booster” words in the lexicon table (p. 22). $\textbf{Clarified figure conventions}$ Updated captions (e.g., Figure 3) to explain that black rectangles indicate 95% confidence intervals from Monte Carlo binomial resampling; we also improved figure readability. $\textbf{Broader Impact section expanded}$ Now explicitly discusses risks of evasion/gameability and over-filtering, referencing Goodhart’s law, and outlines a mitigation plan (ensembles, multilingual checks, adversarial tests). $\textbf{Enhanced connection to prior work}$ We expanded the discussion of prior studies on linguistic uncertainty expressions ([8], [9]) and clarified how our approach differs by analysing naturally occurring lexical markers within chains of thought, rather than studying the impact of lexical markers in a LLM's prompt our answer (p. 6, Introduction). $\textbf{Clarified calibration metric definitions}$ Defined Expected Calibration Error (ECE) and alternatives explicitly in Section 2.1 and referenced complementary metrics in Appendix Figure 6. $\textbf{Addressed experimental framing and limitations}$ Added remarks acknowledging prompt-sensitivity, scale limitations, and future directions (e.g., instance-level hardness and prompt-budget analyses).
Assigned Action Editor: ~Hanie_Sedghi1
Submission Number: 5694
Loading