Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution

Haiyan Zhao; Heng Zhao; Bo Shen; Ali Payani; Fan Yang; Mengnan Du

Beyond Single Concept Vector: Modeling Concept Subspace in LLMs with Gaussian Distribution

Haiyan Zhao, Heng Zhao, Bo Shen, Ali Payani, Fan Yang, Mengnan Du

Published: 22 Jan 2025, Last Modified: 01 May 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language model, Explainability, Probing, Gaussian distribution

Abstract: Probing learned concepts in large language models (LLMs) is crucial for understanding how semantic knowledge is encoded internally. Training linear classifiers on probing tasks is a principle approach to denote the vector of a certain concept in the representation space. However, the single vector identified for a concept varies with both data and training, making it less robust and weakening its effectiveness in real-world applications. To address this challenge, we propose an approach to approximate the subspace representing a specific concept. Built on linear probing classifiers, we extend the concept vectors into Gaussian Concept Subspace (GCS). We demonstrate GCS's effectiveness through measuring its faithfulness and plausibility across multiple LLMs with different sizes and architectures. Additionally, we use representation intervention tasks to showcase its efficacy in real-world applications such as emotion steering. Experimental results indicate that GCS concept vectors have the potential to balance steering performance and maintaining the fluency in natural language generation tasks.

Primary Area: interpretability and explainable AI

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 4838

Loading