Beneath the Surface: Exposing and Mitigating Surface Learning in Large Language Models

Beneath the Surface: Exposing and Mitigating Surface Learning in Large Language Models

ICLR 2026 Conference Submission17380 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Surface Learning, Shortcut Learning, Large Language Models

TL;DR: A novel finding on surface learning and a novel approach to mitigate surface learning.

Abstract: As Large Language Models (LLMs) continue to evolve, assessing their genuine comprehension of underlying knowledge is crucial to ensure the reliability in real-world applications. To evaluate what LLMs learn, we first introduce ME-Test suite, including Mathematical and English grammar examinations, where each question is equipped with relevant knowledge to guide the model. Building upon this, we construct a sequence of questions with increasing difficulty based on Cognitive Load theory, enabling the model to perform continuous problem-solving using the dialogue history. Through a comprehensive evaluation, we uncover a phenomenon of **Surface Learning** behavior on LLMs similar to student learning behavior in Education Psychology. The behavior indicates that although the models seem to know the formulas and strategies required to solve specific types of problems on the surface, they do not truly comprehend the essence of these concepts, resulting in surface-level short-term benefits rather than in-depth learning. Further to mitigate surface learning behavior of LLMs, we propose a long-term strategy for both training-free and post-training scenarios. In training-free scenario, inspired by Self-Concept theory, LLMs are prompted with goal-setting and planning beforehand as well as feedback afterward to improve the ability in reasoning process. To better activate the underlying knowledge during the post-training process, we propose behavior correction strategy to re-rank samples based on the designed self-cognition indicators of LLMs. This strategy prevents models from relying on easy-to-find paradigms to maximize rewards or minimize losses in the initial training stage, rather than undertaking actual reasoning. Extensive experiments of Supervised and Reinforcement Fine-Tuning (SFT, RFT) conducted on LLMs demonstrate the effectiveness of the strategy.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 17380

Loading