Abstract: Recently, slow-thinking reasoning systems, such as o1, have demonstrated remarkable capabilities in solving complex reasoning tasks. These systems typically engage in an extended thinking process before responding to a query, allowing them to generate more thorough, accurate, and well-reasoned solutions.
These systems are primarily developed and maintained by industry, with their core techniques not publicly disclosed. In response, an increasing number of studies from the research community aim to explore the technical foundations underlying these powerful reasoning systems. To reveal the LLM reasoning mechanisms, this paper presents an empirical study on implementing o1-like reasoning systems, focusing on two key questions: (1) \emph{How can LLM learn this reasoning approach} and (2) \emph{How can LLM further improve its reasoning ability without additional demonstration data}.
Concretely, we first design an ``imitate, explore, and self-improve'' framework as our primary technical approach to training the reasoning model.
Then, we conduct the experiment to analyze the influence of different selection strategies of training instance and backbone model, and explore the effect of the self-improving process.
Following the findings in our experiments, we finally train a powerful LLM, which can perform complex reasoning processes, demonstrating superiority in solving challenging reasoning problems. Our models and data will be publicly released.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: mathematical NLP
Contribution Types: Model analysis & interpretability, Reproduction study, Publicly available software and/or pre-trained models, Data resources, Data analysis
Languages Studied: English
Submission Number: 1759
Loading