The role of Mechanistic Interpretability in ’unveiling’ the emergent representations of Large Language Models

TMLR Paper3438 Authors

04 Oct 2024 (modified: 21 Jan 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Despite the fact that significant progress has been achieved by Large Language Models (LLMs), the internal mechanisms that enable the generalization and reasoning abilities still need to be explored. This gap, nurtured by phenomena such as hallucinations, adversarial perturbations, and misaligned human expectations, leads to suspicions that hinder LLMs' safe and beneficial use. This paper provides a comprehensive overview of the current state of explainability approaches related to investigating the underlying mechanisms of LLMs. Therefore, we explore the strategic components we would expect to lay the foundation for generalization capabilities by studying the means to quantify the knowledge acquired and delivered by LLMs and, in particular, discerning the composition and encoding of knowledge within parameters by analyzing mechanistic interpretability, probing techniques, and representation engineering. Finally, we use a mechanistic perspective to explain emergent phenomena that arose in training dynamics, best exemplified by memorisation and generalisation.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Dmitry_Kangin1
Submission Number: 3438
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview