In-Context Occam’s Razor: How Transformers Prefer Simpler Hypotheses on the Fly

Puneesh Deora; Bhavya Vasudeva; Tina Behnia; Christos Thrampoulidis

In-Context Occam’s Razor: How Transformers Prefer Simpler Hypotheses on the Fly

Puneesh Deora, Bhavya Vasudeva, Tina Behnia, Christos Thrampoulidis

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: In-context learning, transformers, linear regression, Markov chains, Bayesian Occam's razor

TL;DR: Transformers exhibit Bayesian Occam's razor in-context

Abstract: In-context learning (ICL) enables transformers to adapt to new tasks through contextual examples without parameter updates. While existing research has typically studied ICL in fixed-complexity setups, real-world language models encounter tasks of diverse complexity levels. This paper investigates how transformers navigate hierarchical task structures where higher-complexity categories can perfectly represent any pattern generated by simpler ones. We design testbeds based on Markov chains and linear regression that reveal transformers not only identify the correct complexity level for each task but also accurately infer the corresponding parameters—even when the in-context examples fit multiple complexity hypotheses. Notably, when presented with data generated by simpler processes, transformers consistently favor the least complex sufficient explanation. We theoretically explain this behavior through a Bayesian framework, demonstrating that transformers effectively implement an in-context Bayesian Occam's razor by balancing model fit against complexity penalties.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 1707

Loading