The Two-Hump Problem: Bridging the Difficulty Gap in Mathematical Reinforcement Learning

Lucas Fagan; Michele Tarquini; Ali Shehper; Maksymilian Manko; Angus Gruen; Coco Huang; Giorgi Butbaia; Davide Passaro; Sergei Gukov

The Two-Hump Problem: Bridging the Difficulty Gap in Mathematical Reinforcement Learning

Lucas Fagan, Michele Tarquini, Ali Shehper, Maksymilian Manko, Angus Gruen, Coco Huang, Giorgi Butbaia, Davide Passaro, Sergei Gukov

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We identify a structural barrier in the 60-year-old Andrews-Curtis conjecture and use supermoves, a custom transformer architecture, and novel data generation techniques to make concrete progress.

Abstract: Mathematical search problems present a unique challenge for Reinforcement Learning (RL) due to vast search spaces and sparse rewards. In previous works, the Andrews-Curtis (AC) conjecture was established as an illustrative example of such problems. In this work, we identify a critical structural barrier in the AC landscape: a "Two Hump" distribution, where problem instances are either trivially solvable or effectively impossible, with a scarcity of intermediate "hard-but-solvable" instances required for effective learning. We tackle this challenge through two primary avenues: novel data generation techniques to populate the difficulty gap, and significant algorithmic enhancements including the introduction of supermoves and Transformer-based architectures. We demonstrate substantial performance improvements over previous baselines, and release new comprehensive benchmark datasets including **AC-19** (125,192 AC-trivial presentations of varying difficulty with length at most 19) and **AC-1M** (1,136,154 hard AC-trivial presentations of length at most 30), the first large-scale, publicly available datasets of this kind.

Lay Summary: The Andrews–Curtis conjecture, formulated in 1965, asks whether every algebraic "recipe" for the trivial group can be reduced to the simplest possible recipe using a fixed set of moves, a deceptively simple question with 60 years of resistance and deep connections to 4-dimensional topology. We applied reinforcement learning (the family of techniques behind AlphaGo) to search for these reductions, but hit a "Two-Hump" obstacle: almost every example is either trivially easy or effectively impossible, leaving little for the learner to grip onto. We address this with a specialized neural architecture, a system that generates fresh intermediate-difficulty problems, and "supermoves" that bundle many primitive moves into one. The result: we solve over 150 more problems than the previous state of the art, reduce a famous family of 550 unsolved cases to 261, release two large datasets, and provide a template for using machine learning on other mathematical search problems with the same difficulty structure.

Link To Code: https://github.com/Math-AI-Caltech/ACSolverX

Primary Area: Reinforcement Learning

Keywords: Reinforcement Learning, Mathematical Reasoning, AI for Math, Sparse Rewards, Benchmark Datasets, Andrews–Curtis Conjecture, Curriculum Learning

Originally Submitted PDF: pdf

Submission Number: 12994

Loading