Rethinking The Reliability of Representation Engineering in Large Language Models

26 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: transparency, interpretability, causality, AI safety
Abstract: Inspired by cognitive neuroscience, representation engineering (RepE) seeks to connect the neural activities within large language models (LLMs) to their behaviors, providing a promising pathway towards transparent AI. Despite its successful applications under many contexts, the connection established by RepE is not always reliable, as it implicitly assumes that LLMs will consistently follow the roles assigned in the instructions during neural activities collection. When this assumption is violated, observed correlations between the collected neural activities and model behaviors may not be causal due to potential confounding biases, thereby compromising the reliability of RepE. We identify this key limitation and propose CAusal Representation Engineering (CARE), a principled framework that employs matched-pair trial design to control for confounders. By isolating the impact of confounders on neural activities and model behaviors, CARE grounds the connection in causality, allowing for more reliable interpretations and control of LLMs. Extensive empirical evaluations across various aspects of safety demonstrate the effectiveness of CARE compared to the original RepE implementation, particularly in controlling model behaviors, highlighting the importance of causality in developing transparent and trustworthy AI systems.
Supplementary Material: pdf
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6408
Loading