ResearchArena-CayleyBench: RL/LLM Benchmark challenges which can advance mathematical research

19 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Benchmarking, LLMs, Caley, RL
TL;DR: RL/LLM Benchmark with Caley graph challenges which can advance mathematical research
Abstract: The aim of this paper is to propose benchmark datasets, framed as Kaggle challenges, for testing ML/RL methods and LLMs on mathematical research problems in group and graph theory. Progress on these challenges would not only demonstrate advances in AI but could also resolve long-standing open problems and fundamental conjectures in the field, some of which have been studied by leading mathematicians such as T. Tao. Use of Kaggle platform provides public visibility, clear defined rules and gamification for the benchmarks. Each challenge is formulated as a sorting task: given an input a vector of integers, the goal is to reach the sorted order using only a prescribed set of moves. Classic examples include prefix (pancake) sorting, solving the Rubik’s cube, simplest bubble sorting, etc. These tasks correspond precisely to path-finding in Cayley graphs, central objects in mathematics. From RL point of view graph nodes represent states, edges correspond to actions, and edge weights encode rewards. The value function - shortest path length ("word metric"). The challenge for RL methods is to handle as large graphs as possible aiming to googol sizes. While the task for LLM is to present algorithmic solutions. The second aim is to present AI-based open-source CayleyPy library to work with the Cayley graphs and hundreds mathematical conjectures generated with it. CayleyPy outperforms on several tasks classical computer algebra system GAP/SAGE by many orders of magnitude, in some cases it allows path-finding on graphs of googol size. It supports arbitrary permutation or matrix groups given as an input, and also maintains predefined collection of hundreds Cayley graphs, including dozens puzzle originated.
Primary Area: datasets and benchmarks
Submission Number: 19131
Loading