Could language models win the International Linguistics Olympiad?

Jamie Garnham; Ehsan Shareghi

Could language models win the International Linguistics Olympiad?

Jamie Garnham, Ehsan Shareghi

Published: 18 May 2026, Last Modified: 18 May 2026CoNLL 2026 ArchivalEveryoneRevisionsBibTeXCC BY 4.0

Keywords: linguistic reasoning, linguistics olympiad, linguistic puzzles, in-context learning, LLMs, large language models, inference-time scaling

TL;DR: By developing the first inference-time scaling framework for linguistic puzzles, we show that linguistic reasoning remains significantly harder for LLMs than math or commonsense tasks.

Abstract: Linguistic puzzles, wherein the solver must deduce rules of an unfamiliar language purely in-context, represent a uniquely perplexing problem format even for state-of-the-art large language models. Yet by exploring various inference-time scaling methods, we demonstrate that language models’ performance on these problems can be improved without the need for fine-tuning or providing supplementary linguistic context. To this end, this paper introduces the first domain‑specific inference‑time scaling framework for linguistic puzzles, which we use to improve the performance of three model families - R1 (Deepseek), Gemini 2.5 Flash (Google), and Llama 3.3 70B Instruct (Meta) - on a challenging Linguistics Olympiad-based benchmark by 4.9, 13.1, and 4.9 percentage points, respectively. Nonetheless, even when multiple optimisations are applied, we find that LLMs’ linguistic puzzle performance remains well below comparable mathematical and commonsense benchmarks, and we speculate as to why linguistic reasoning continues to pose a distinctive challenge for even the most capable large language models.

Scope Confirmation: To the best of my judgment, this submission falls within the scope of CoNLL.

Primary Area Selection: Theoretical Analysis and Interpretation of ML Models for NLP

Secondary Area Selection: Syntax and Morphology, Other (specify below!)

Other Secondary Area: Language 'Acquisition' by LLMs

Use Of Generative Artificial Intelligence Tools: Yes, for editing/proofreading the manuscript, Yes, for writing code

Data Collection From Human Subjects: No

Submission Type: Archival: I certify that the submission has not been previously published, nor is the material in it under review by another journal or conference. Further, no material in it will be submitted for review at another conference or journal while under review by CoNLL 2026.

Submission Number: 161

Loading