Keywords: Large Language Models, Reasoning, Benchmark, Linguistic reasoning, Permutation
TL;DR: An inductive reasoning benchmark about natural languages designed to minimise the ability to solve with knowledge or memory
Abstract: Frontier language models demonstrate increasing ability at solving reasoning
problems, but their performance is often inflated by circumventing reasoning and
instead relying on their expanding knowledge and memorisation capacity. We
introduce LINGOLY-TOO, a challenging reasoning benchmark of 1,203 questions
and a total of 6,995 sub-questions that counters these shortcuts by applying expert-
designed obfuscations to Linguistics Olympiad problems. These obfuscations
preserve the underlying solution logic while reducing the likelihood problems
are solvable with via knowledge or memorisation. Our experiments show that
models exploit shortcuts on the original question as performance markedly drop
upon obfuscation. Even the best reasoning models remain highly sensitive, with
scores dropping from around 0.59 on original problems to 0.48 after obfuscation.
LINGOLY-TOO disentangles reasoning from knowledge, offering a clearer measure
of true reasoning capabilities.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 19790
Loading