GSM-$\infty$: How Do your LLMs Behave over Infinitely Increasing Reasoning Complexity and Context Length?
TL;DR: We study LLMs reasoning ability decay with respect to increasingly harder problems and with respect to increasing context length through synthesized dataset generator that generates fine-grainedly controlled GSM8K-like problems.
Abstract: Recently, long-context large language models (LLMs) have shown strong performance in information retrieval and long-document QA. However, to tackle the most challenging intellectual problems, LLMs must reason effectively in long and complex contexts (e.g., frontier mathematical research). Studying how LLMs handle increasing reasoning complexity and context length is essential, yet existing benchmarks lack a solid basis for quantitative evaluation. Inspired by the abstraction of GSM-8K problems as computational graphs—and the ability to introduce noise by adding unnecessary nodes and edges—we develop a grade-school math problem generator capable of producing arithmetic problems with infinite difficulty and context length under fine-grained control. Using our newly synthesized GSM-$\infty$ benchmark, we comprehensively evaluate existing LLMs. We find a consistent sigmoid decline in reasoning performance as complexity increases, along with a systematic inference scaling trend: exponentially increasing inference computation yields only linear performance gains. These findings underscore the fundamental limitations of current long-context LLMs and the key challenges in scaling reasoning capabilities. Our GSM-$\infty$ benchmark provides a scalable and controllable testbed for systematically studying and advancing LLM reasoning in long and complex contexts.
Lay Summary: Do we really need LLMs to have long-context ability? Retrieval Augmented Generation (RAG) is a cheap-to-build alternative that is seemingly powerful for general long-context tasks, while a long context window tends to require big companies millions of dollars to train. In the paper, we discover that on most existing long-context benchmarks, through comprehensive evaluation, RAG is on par or even surpasses the performance of long-context LLMs. However, context-level methods alone are not sufficient for agents that one day will be capable of contributing to frontier scientific discovery. We clearly need a much more difficult long-context benchmark.
In the paper, we first develop a mapping between an abstract computation graph and natural language problems. Then, by randomly perturbing the computation graphs' generation, we can map to different natural language math problems. The difficulty of a problem is defined as the number of essential steps required to solve the particular problem. Besides, the computation graph can be extended to include unnecessary nodes. Benefitting from the tight semantic connections to the essential core graph, the added noise is RAG-insolvable. Empirically, we show that the added noise is hard to filter by RAG.
By using the generator, we are able to generate a large quantity of problems guaranteed to be correctly labeled with controllable reasoning complexity and context length of the problems. We named the suite of problems generated as GSM-Infinite. We comprehensively evaluate LLMs on GSM-Infinite. We discovered that LLM performance decay follows a sigmoid pattern, alongside other insights revealed from our studies.
Link To Code: https://github.com/Infini-AI-Lab/gsm_infinite
Primary Area: Deep Learning->Large Language Models
Keywords: Long Context, Reasoning, Understanding, Benchmarks
Submission Number: 13525
Loading