Towards Uncovering How Large Language Models Work: An Interpretability Perspective

Towards Uncovering How Large Language Models Work: An Interpretability Perspective

ACL ARR 2024 June Submission242 Authors

08 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have led to breakthroughs in language tasks, yet the internal mechanisms that enable their remarkable generalization and reasoning abilities remain opaque. This lack of transparency presents challenges such as hallucinations, toxicity, and misalignment with human values, hindering the safe and beneficial deployment of LLMs. This survey paper aims to uncover the internal working mechanisms underlying LLM functionality through the lens of explainability. First, we review how knowledge is encoded within LLMs via mechanistic interpretability techniques. Then, we summarize how knowledge is embedded in LLM representations by leveraging probing techniques and representation engineering. Additionally, we investigate the training dynamics through a mechanistic perspective to explain phenomena such as grokking and memorization. Lastly, we explore how the insights gained from these explanations can enhance LLM performance through model editing, improve efficiency through pruning, and better align with human values.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: knowledge tracing/discovering/inducing, probing

Contribution Types: Surveys

Languages Studied: English

Submission Number: 242

Loading