How does controllability emerge in language models during pretraining?

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: controllability, ability emergence, pre-training models, dimentionality reduction, representations
TL;DR: discovering and analysing the reason that model's controllability towards different concept could emergence at different stage of training, and we use a method to detect this property
Abstract: Language models can intervened upon by steering their internal representations, which alters the degree to which concepts such as emotional tone, style, truthfulness, and safety are expressed in their generative outputs. This paper demonstrates that intervention efficacy, measured by linear steerability (the ability to adjust outputs via linear transformations of hidden states), emerges abruptly during pre-training, and furthermore, even closely-related concepts (e.g. anger and sadness) can emerge at different stages of pre-training. To understand how the steerability of internal representations changes during pre-training, we introduce the "Intervention Detector" (ID), which applies unsupervised learning techniques to hidden states under different stimuli, and generates concept representations that can be used to steer the text generation of language models. The extracted concept representations are used to compute an ID score, measuring their alignment with the model’s hidden states. This ID score can be used to approximately predict the time of emergence of effective intervention by steering different concepts, and the degree to which each concept is able to intervene. By analyzing ID scores across a longitudinal series of models taken at different stages of pre-training, we demonstrate that, as pre-training progresses, concepts become increasingly easier to extract via linear methods, which correlates with the emergence of steerability. For instance, in the CrystalCoder model, the linear steerability of the concept "anger" emerges at 68\% of pre-training, whereas the linear steerability of the concept ``sadness" emerges at 93\% of the pre-training process. We use heatmap visualizations and other metrics (eg., entropy, cosine similarity, tSNE) to study these differences and validate the reliability and generalizability of ID scores through model interventions using the extracted concept representations.
Primary Area: interpretability and explainable AI
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 10337
Loading