The role of Mechanistic Interpretability in ’unveiling’ the emergent representations of Large Language Models
Abstract: Despite the fact that significant progress has been achieved by Large Language Models (LLMs), the internal mechanisms that enable the generalization and reasoning abilities still need to be explored. This gap, nurtured by phenomena such as hallucinations, adversarial perturbations, and misaligned human expectations, leads to suspicions that hinder LLMs' safe and beneficial use. This paper provides a comprehensive overview of the current state of explainability approaches related to investigating the underlying mechanisms of LLMs. Therefore, we explore the strategic components we would expect to lay the foundation for generalization capabilities by studying the means to quantify the knowledge acquired and delivered by LLMs and, in particular, discerning the composition and encoding of knowledge within parameters by analyzing mechanistic interpretability, probing techniques, and representation engineering. Finally, we use a mechanistic perspective to explain emergent phenomena that arose in training dynamics, best exemplified by memorisation and generalisation.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Dmitry_Kangin1
Submission Number: 3438
Loading