EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages

Published: 08 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop LLM ReasoningEveryoneRevisionsBibTeXCC BY 4.0
Track: long paper (up to 10 pages)
Keywords: Large language models, code generation, out-of-distribution evaluation, esoteric programming languages, benchmark contamination, in-context learning, reasoning vs retrieval, agentic systems
TL;DR: Frontier LLMs achieving 85-95% on standard code benchmarks score only 0-11% on equivalent tasks in esoteric languages, revealing reliance on pattern matching over genuine reasoning.
Abstract: Large language models achieve near-ceiling performance on code generation benchmarks, yet these results increasingly reflect memorization rather than genuine reasoning. We introduce EsoLang-Bench, a benchmark using five esoteric programming languages—Brainfuck, Befunge-98, Whitespace, Unlambda, and Shakespeare—that lack benchmark gaming incentives due to their economic irrationality for pre-training. These languages require the same computational primitives as mainstream programming but have 1,000–100,000× less training data than Python. We evaluate five frontier models across five prompting strategies and find a dramatic capability gap: models achieving 85–95% on standard benchmarks score only 0–11% on equivalent esoteric tasks, with 0% accuracy beyond the Easy tier. Few-shot learning and self-reflection fail to improve performance, suggesting these techniques exploit training priors rather than enabling genuine learning. EsoLang-Bench provides the first benchmark designed to mimic human learning—acquiring new languages through documentation, interpreter feedback, and iterative experimentation—measuring transferable reasoning skills resistant to data contamination.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 13
Loading