Multi-property Steering of Large Language Models with Dynamic Activation Composition

Published: 21 Sept 2024, Last Modified: 03 Oct 2024BlackboxNLP 2024 ARR SubmissionsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Interpretability, Multilingualism, Generation, activation steering, steering vector, conditional generation
TL;DR: We introduce Dynamic Activation Composition, a new strategy to combine steering vectors for multi-property conditioning of LLM generations resulting in high conditioning accuracy while preserving high generation fluency
Abstract: Activation steering methods were shown to be effective in conditioning language model generation by additively intervening over models' intermediate representations. However, the evaluation of these techniques has so far been limited to single conditioning properties and synthetic settings. In this work, we conduct a comprehensive evaluation of various activation steering strategies, highlighting the property-dependent nature of optimal parameters to ensure a robust effect throughout generation. To address this issue, we propose Dynamic Activation Composition, an information-theoretic approach to modulate the steering intensity of one or more properties throughout generation. Our experiments on multi-property steering show that our method successfully maintains high conditioning while minimizing the impact of conditioning on generation fluency.
Comment: Dear SACs, We are committing our paper as it offers novel insights combining interpretability aspects with controlled generation - both thriving research areas. Indeed, two of the reviewers are very positive about our work (4.5 and 4). Reading the meta-review, we were surprised to see that **the significance and relevance of our work and the positive aspects the reviewers underscored, particularly in relation to the innovative techniques introduced for multi-property steering, were almost completely overlooked**. Indeed, **the meta-review disproportionately emphasizes the single lowest-score review, disregarding our rebuttal to it and the favorable reviews**. We would like to point out that the low-scoring reviewer did not engage in discussion or acknowledge our rebuttal after our response. We have already left a confidential comment with our concerns regarding the meta-review: https://openreview.net/forum?id=6qioVR5ecT. We encourage the SAC to read the discussion for details, but we anyway summarise the core points here for convenience: 1. The paper focuses on multiple properties rather than multiple models because steering techniques are already known to work across models, which the rebuttal clarified but was not acknowledged. 2. The meta-review's suggestion to improve fluency evaluation lacks specific recommendations, and the fluency evaluation follows standard practices in a controlled environment. 3. The meta-review characterization of the technique is incorrect, as the steering intensity only reduces when fluency is impacted, supported by the results. 4. We chose the interpretability track since the work aligns with techniques based on model internals, consistent with the interpretability community's recent findings. We hope you'll consider these points during your review. Please feel free to contact us if you need any clarification or additional information.
Paper Link: https://openreview.net/forum?id=6qioVR5ecT
Submission Number: 2
Loading