Higher Embedding Dimension Creates a Stronger World Model for a Simple Sorting Task

Published: 23 Sept 2025, Last Modified: 22 Nov 2025LAWEveryoneRevisionsBibTeXCC BY 4.0
Keywords: embedding dimension, world model, transformer, reinforcement learning, sorting, PPO, interpretability
TL;DR: We study how minimal language-model-based agents—transformers trained via reinforcement learning—develop internal world models when solving a structured sorting task.
Abstract: We study how minimal language-model-based agents—transformers trained via reinforcement learning—develop internal world models when solving a structured sorting task. While even very small embedding dimensions are sufficient for models to achieve high accuracy, larger dimensions yield representations that are more faithful, consistent, and robust. In particular, higher embedding dimensions strengthen the formation of structured internal representation and leads to better interpretability. After hundreds of experiments, we observe two consistent mechanisms: (1) the last row of the attention weight matrix monotonically encodes the global ordering of tokens; and (2) the selected transposition aligns with the largest adjacent difference of these encoded values. Together, these findings provide quantitative evidence that transformers form interpretable internal world models, and that model size improves interpretability in addition to performance. We release metrics and analyses that can be reused to probe similar tasks.
Submission Type: Research Paper (4-9 Pages)
Submission Number: 29
Loading