How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders

How LLMs Learn: Tracing Internal Representations with Sparse Autoencoders

ACL ARR 2025 February Submission6786 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) demonstrate remarkable multilingual capabilities and broad knowledge. However, the internal mechanisms underlying the development of these capabilities remain poorly understood. To investigate this, we analyze how the information encoded in LLMs' internal representations evolves during the training process. Specifically, we train sparse autoencoders at multiple checkpoints of the model and systematically compare the interpretative results across these stages. Our findings suggest that LLMs initially acquire language-specific knowledge independently, followed by cross-linguistic correspondences. Moreover, we observe that after mastering token-level knowledge, the model transitions to learning higher-level, abstract concepts, indicating the development of more conceptual understanding.

Paper Type: Short

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: knowledge tracing/discovering/inducing, multilingualism

Contribution Types: Model analysis & interpretability

Languages Studied: English, Japanese

Submission Number: 6786

Loading