Mechanistic Interpretability analysis of a single-layer transformer on 0-1 knapsack

Mechanistic Interpretability analysis of a single-layer transformer on 0-1 knapsack

ICLR 2026 Conference Submission25431 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Mechanistic Interpretability, Machine Learning, grokking, knapsack problem

TL;DR: mechanistic interpretability of a single-layer transformer on 0-1 knapsack, shows the inability of transformers to solve NP-complete tasks

Abstract: Small language models have been shown to exhibit generalisation for toy problems while being trained on algorithmically generated datasets. It is poorly understood whether this phenomenon happens in complex problems such as NP-complete problems. In this work, we show the inability of a single-layer transformer to "grok" the 0-1 knapsack problem. We analyze the internals using visualisations and interpretability techniques and show why the model is not able to form a robust internal circuit. This shows how transformer-based models struggle to generalize on NP-complete problems as well as their inability to solve problems requiring high amount of computation. This work showcases why LLM-based AI agents should not be deployed in high-impact spaces where a vast amount of planning and computation is required.

Primary Area: interpretability and explainable AI

Submission Number: 25431

Loading