EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages
Keywords: large language models, in-context learning, code generation, esoteric programming languages, out-of-distribution evaluation, reasoning, benchmark
TL;DR: Frontier LLMs achieve 85-95% on standard code benchmarks but only 0-11% on esoteric languages, revealing that few-shot learning provides no improvement and models rely on pattern matching rather than genuine reasoning.
Abstract: We present EsoLang-Bench, a benchmark revealing fundamental limitations in how large language models (LLMs) leverage in-context learning (ICL). Frontier models achieving 85–95% accuracy on standard code benchmarks (HumanEval, MBPP) score only 0–11% on esoteric programming languages with scarce training data. Notably, few-shot prompting yields no statistically significant improvement over zero-shot (p = 0.505), contradicting assumptions about ICL enabling adaptation to novel domains. Our analysis indicates that ICL primarily activates training priors rather than enabling genuine learning. Despite this limitation, self-scaffolding with direct interpreter feedback outperforms multi-agent approaches, and agentic systems achieve 2–3× improvement through interpreter feedback loops with efficient context management. These findings have implications for understanding LLM generalization to out-of-distribution domains.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 45
Loading