Learning Distribution-wise Control in Representation Space for Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC 4.0
TL;DR: We propose learning distribution-wise control in the latent space of LM that is effective against existing PEFT and other intervention baselines.
Abstract: Interventions in language models (LMs) are applied strategically to steer model behavior during the forward pass. Learnable interventions, also known as representation fine-tuning, aim to apply pointwise control within the concept subspace and have proven effective in altering high-level behaviors. In this work, we extend this approach to the distribution level, enabling the model to learn not only pointwise transformations but also the surrounding regions of the concept subspace. We demonstrate that these methods perform effectively in early layers, with larger standard deviations correlating strongly with improved performance. Across eight commonsense reasoning and seven arithmetic reasoning benchmarks, our distribution-wise interventions consistently outperform pointwise interventions in controllability and robustness. These results illustrate that distribution-wise interventions provide a more comprehensive method for steering model behavior and enabling finer-grained control over language models. The code is at: https://github.com/chili-lab/D-Intervention.
Lay Summary: - **Problem**: Current methods for controlling AI language models work like adjusting a single point on a dial—they make precise changes but miss the surrounding area where similar beneficial effects might occur. This limits how effectively we can steer these models to behave the way we want them to. - **Solution**: We developed a new approach that works more like adjusting a region rather than a single point. Instead of making one exact change to how the AI processes information, our method learns to make small variations around that change, exploring the "neighborhood" of possibilities. Think of it like the difference between hitting one specific note on a piano versus playing a gentle chord that includes nearby harmonious notes. - **Impact**: Our method consistently outperformed existing techniques across 15 different reasoning tasks, showing improvements of 2-4% while using fewer computational resources. More importantly, it made AI models more robust—they maintained better performance even when faced with slightly altered or corrupted inputs. This advancement helps make AI language models more reliable and easier to control, which is crucial as these systems are increasingly used in real-world applications where consistent, predictable behavior matters.
Link To Code: https://github.com/chili-lab/D-Intervention
Primary Area: Deep Learning->Large Language Models
Keywords: Interpretability, Intervention, Model Steering, Representation Learning, Large Language Models
Submission Number: 1388
Loading