TopoBench: Benchmarking LLMs on hard topological reasoning
Track: long paper (up to 10 pages)
Keywords: LLM, Benchmarks, Topology, Arc-AGI
TL;DR: Introduce TopoBench a diagnostic framework to test topological reasoning in LLMs.
Abstract: Solving topology grid puzzles requires maintaining global spatial invariants such as connectivity, loop closure, and region symmetry and remains challenging for even the most powerful large language models (LLMs). To study these abilities under controlled settings, we introduce TopoBench, a benchmark of six puzzle families across three difficulty levels. We evaluate strong reasoning LLMs on TopoBench and find that even frontier models solve fewer than one quarter of hard instances, with two families nearly unsolved. To further understand the failure modes, we annotate 750 chain of thought traces with an error taxonomy that surfaces four candidate causal failure modes, then test them with targeted interventions simulating each error type. These interventions show that certain error patterns like premature commitment (going down a wrong solution path early on) and constraint forgetting (moves that violate rules) have a direct impact on the ability to solve the puzzle while repeated-reasoning (re-trying the same reasoning path without meaningful variation) is a benign effect of search. Finally we study mitigation strategies aimed at reducing these error patterns including prompt guidance, cell-aligned grid representations and tool-based constraint checking, finding that the bottleneck lies in extracting constraints from spatial representations and not in reasoning over them.
Presenter: ~Mayug_Maniparambil1
Format: Maybe: the presenting author will attend in person, contingent on other factors that still need to be determined (e.g., visa, funding).
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 90
Loading