TL;DR: We show how to linear concepts in the representation space of a transformer to build a form of logical implication into models.
Abstract: The field of mechanistic interpretability in pre-trained transformer models has demonstrated substantial evidence supporting the ''linear representation hypothesis'', which is the idea that high level concepts are encoded as vectors in the space of activations of a model. Studies also show that model generation behavior can be steered toward a given concept by adding the concept's vector to the corresponding activations. We show how to leverage these properties to build a form of logical implication into models, enabling transparent and interpretable adjustments that induce a chosen generation behavior in response to the presence of any given concept. Our method, Logical Implication Model Steering (LIMS), unlocks new hand-engineered reasoning capabilities by integrating neuro-symbolic logic into pre-trained transformer models.
Lay Summary: The Logical Implication Model Steering (LIMS) method provides a simple and interpretable way to adjust the behavior of large language models without retraining them. Instead of relying on extensive examples or opaque fine-tuning, LIMS allows users to specify a logical rule such as “if condition $ P$ is true, then the model should behave according to $Q$.” For example, one might want a model to respond cautiously when it detects uncertainty or avoid answering when insufficient information is present. LIMS works by identifying internal patterns, called concept vectors, associated with both the condition and the desired behavior. These are derived from a small set of labeled examples. The method then installs a lightweight, targeted circuit inside the model that activates the desired behavior only when the condition is met. This provides an efficient, interpretable mechanism for guiding model outputs in a structured and predictable way.
Primary Area: Deep Learning->Other Representation Learning
Keywords: Machine Learning, ICML, Mechanistic Interpretability, Interpretability, Safety, LLM, Transformer, Neuro-Symbolic, Language Modelling, Transformer, Steering
Submission Number: 13790
Loading