DeepSeek-R1 Thoughtology: Let’s think about LLM reasoning

TMLR Paper5408 Authors

17 Jul 2025 (modified: 02 Oct 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an answer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly “thinking” about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1’s basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-à-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a ‘sweet spot’ of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: ### **Measuring rumination** As suggested by reviewer Q4pz, we have now carried out additional experiments to quantify rumination: In section 3.3, we define rumination rate using two measures, (i) n-gram repetition rate, and (ii) lexical diversity. We show that rumination rises with task complexity (Fig. 3.6), and that it varies widely across different tasks and is independent from simply processing time or time spent in reconstruction cycles (Fig 3.3b). In section 4, we show that incorrect responses have a higher rumination rate compared to correct responses for math reasoning tasks (Fig. 4.4), indicating that rumination may hurt performance and might be one of the reasons why overly long CoTs fail to reach the correct answer. In section 9, we show that rumination is higher for psycholinguistic stimuli vs controls (Fig. 9.3). ### **Note discussing LRM and human reasoning dissimilarity** As suggested by reviewer yXpG, we have added a note at the end of section 9.3 speculating about a potential reason behind the divergence in human reasoning and LRM reasoning for the garden path and comparative illusion experiments. ### **More samples in long-context reasoning analysis** As suggested by reviewer Q4pz, we experiment with increased sample size for long-context reasoning analysis on the CHASE dataset in section 5.2 which strengthens our findings.
Assigned Action Editor: ~Mohammad_Emtiyaz_Khan1
Submission Number: 5408
Loading