LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

Sumeet Ramesh Motwani; Daniel Nichols; Charles London; Peggy Li; Fabio Pizzati; Acer Blake; Hasan Abed Al Kader Hammoud; Tavish Malcolm McDonald; Akshat Naik; Alesia Ivanova; Vignesh Baskaran; Ivan Laptev; Ruben Glatt; Tal Ben-Nun; Philip Torr; Ameya Prabhu; Brian R. Bartoldson; Bhavya Kailkhura; Christian Schroeder de Witt

LongCoT: Benchmarking Long-Horizon Chain-of-Thought Reasoning

Sumeet Ramesh Motwani, Daniel Nichols, Charles London, Peggy Li, Fabio Pizzati, Acer Blake, Hasan Abed Al Kader Hammoud, Tavish Malcolm McDonald, Akshat Naik, Alesia Ivanova, Vignesh Baskaran, Ivan Laptev, Ruben Glatt, Tal Ben-Nun, Philip Torr, Ameya Prabhu, Brian R. Bartoldson, Bhavya Kailkhura, Christian Schroeder de Witt

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We develop a benchmark to measure the long-horizon reasoning capabilities of frontier models

Abstract: As language models are increasingly deployed for complex autonomous tasks, their ability to reason accurately over longer horizons becomes critical. An essential component of this ability is planning and managing a long, complex chain-of-thought (CoT). We introduce LongCoT, a scalable benchmark of 2,500 expert-designed problems spanning chemistry, mathematics, computer science, chess, and logic to isolate and directly measure the long-horizon CoT reasoning capabilities of frontier models. Problems consist of a short input with a verifiable answer; solving them requires navigating a graph of interdependent steps that span tens to hundreds of thousands of reasoning tokens. Each local step is individually tractable for frontier models, so failures reflect long-horizon reasoning limitations. At release, the best models achieve <10% accuracy (GPT 5.2: 9.8%; Gemini 3 Pro: 6.1%) on LongCoT, revealing a substantial gap in current capabilities. Overall, LongCoT provides a rigorous measure of long-horizon reasoning, tracking the ability of frontier models to reason reliably over extended periods.

Lay Summary: As AI models get used for longer and more complicated tasks, a key question is whether they can keep their reasoning coherent over very long stretches of thinking, not just on short hard problems. We built LongCoT, a benchmark of 2,500 problems across chemistry, math, computer science, chess, and logic, where each question is short to read but forces the model to work through a long chain of interconnected steps, often tens or hundreds of thousands of words of reasoning. We deliberately made every individual step easy for today's best models, so when a model gets the final answer wrong it is because it lost the thread over a long stretch rather than because any single piece was too hard. The strongest model we tested solves under 10 percent of the harder problems, and most models score close to zero. They tend to plan poorly early on, let small mistakes compound, start guessing, or give up too soon, and they rarely go back to fix a wrong turn. LongCoT gives a clean and verifiable way to measure this kind of long horizon reasoning, which matters because the most ambitious uses of AI, from scientific discovery to complex engineering, will need models that can think reliably over very long periods.

Originally Submitted Supplementary Material: zip

Link To Code: https://github.com/LongHorizonReasoning/longcot

Primary Area: Deep Learning->Large Language Models

Keywords: long-horizon, reasoning, language models, benchmark, chain-of-thought

Originally Submitted PDF: pdf

Submission Number: 32520

Loading