In-Context Learning for Esoteric Programming Languages: Evaluating and Enhancing LLM Reasoning Without Fine-Tuning

Michael X. Wang; Saraswathy Amjith; Arul Kolla; Jayson Lynch; Neil Thompson

In-Context Learning for Esoteric Programming Languages: Evaluating and Enhancing LLM Reasoning Without Fine-Tuning

Michael X. Wang, Saraswathy Amjith, Arul Kolla, Jayson Lynch, Neil Thompson

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, In-Context Learning, Code Generation, AI Benchmarking, Agentic AI

TL;DR: Current LLMs demonstrate a limited ability to utilize documentation and examples to program in esoteric languages, and this can be used for in-context self-improvement.

Abstract: Large Language Models (LLMs) have revolutionized mainstream software development, yet their ability to generalize to esoteric languages — who may have small or no representation in the training corpus —remains poor. Programming in esoteric languages tests a model’s capacity to infer novel grammar and leverage nontrivial reasoning capabilities in utilizing the documentation. To quantify these effects, we evaluate both open and closed-source LLMs on code generation and language identification tasks across four esoteric languages—Minipy, Pyth, Rhokell, and 0815—and compare traditional prompt‐based methods to agentic coding IDEs. Our findings reveal that LLMs can now generate some correct code in these languages when provided with documentation and sparse examples; however, performance remains far below that of similar models in common programming languages. Furthermore, we introduce a novel in-context augmentation strategy in which LLMs first generate solutions, which are then manually verified and re-inserted as examples into subsequent prompts. Our results indicate that strategically embedding just a few analogous problems can yield large accuracy improvements without any model retraining. Our findings show that this ``self-scaffolding'' approach can boost performance on coding benchmarks: inserting Deepseek’s verified EsoEval solutions raised EsoEval accuracy on Pyth from 16.67% to 30.82 %, while HumanEval accuracy on Minipy jumped from 51% to 65%. We offer this as a flexible alternative to costly fine-tuning, paving the way for rapid adaptation of LLMs to highly specialized, emerging, or other low data domains.

Submission Number: 88

Loading