T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We present a reasoning model T1 to scale reinforcement learning by encouraging exploration and understand inference scaling.
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling. While reinforcement learning (RL) holds promise for enabling self-exploration, recent attempts yield modest improvements in complex reasoning. In this paper, we present T1 to scale RL by encouraging exploration and understand inference scaling. We first initialize the LLM using synthesized chain-of-thought data that integrates trial-and-error and self-verification. To scale RL training, we promote increased sampling diversity through over-sampling. We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. More importantly, we present a simple strategy to examine inference scaling, where increased inference budgets directly lead to T1’s better performance without any additional verification. The model weights and training data are publicly available at https://github.com/THUDM/T1.
Lay Summary: Current AI language models have limitations when tackling complex reasoning tasks like advanced mathematics. While these models can learn from examples, they typically don't improve their reasoning during the actual problem-solving process - similar to a student who can't think more deeply about a problem even when given extra time. We developed T1, a training method that helps AI models explore different approaches and learn from trial-and-error. Our approach includes three key strategies: first, we train the model using examples that show both failed attempts and successful solutions; second, we encourage the model to generate many diverse reasoning paths during training; and third, we help it spend more time considering harder problems. T1 shows a inference scaling pattern - the more time it spends "thinking" about a problem, the better its performance tends to become. On challenging mathematics competitions, T1 performs better than previous methods we tested. While there's still much work to be done, this suggests that giving AI more reasoning time can lead to improved solutions on difficult problems.
Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.
Primary Area: Deep Learning->Large Language Models
Keywords: Language Model Reasoning, Reinforcement Learning
Submission Number: 4871
Loading