LINGOLY-TOO: Disentangling Reasoning from Knowledge with Templatised Orthographic Obfuscation

Published: 26 Jan 2026, Last Modified: 02 Mar 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Models, Reasoning, Benchmark, Linguistic reasoning, Permutation
TL;DR: An inductive reasoning benchmark about natural languages designed to minimise the ability to solve with knowledge or memory
Abstract: Frontier language models demonstrate increasing ability at solving reasoning problems, but their performance is often inflated by circumventing reasoning and instead relying on their expanding knowledge and memorisation capacity. We introduce LINGOLY-TOO, a challenging reasoning benchmark of 1,203 questions and a total of 6,995 sub-questions that counters these shortcuts by applying expert- designed obfuscations to Linguistics Olympiad problems. These obfuscations preserve the underlying solution logic while reducing the likelihood problems are solvable with via knowledge or memorisation. Our experiments show that models exploit shortcuts on the original question as performance markedly drop upon obfuscation. Even the best reasoning models remain highly sensitive, with scores dropping from around 0.59 on original problems to 0.48 after obfuscation. LINGOLY-TOO disentangles reasoning from knowledge, offering a clearer measure of true reasoning capabilities.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19790
Loading