Keywords: Model Editing, Steering, Activation, Interpretability
Abstract: Retrofitting large language models (LLMs) with new behaviors typically requires full finetuning or distillation—costly steps that must be repeated for every architecture. In this work, we introduce ⌘V (Command-V), a backpropagation-free behavior transfer method that copies an existing residual representation adapter from a donor model and pastes its effect into an architecturally different recipient model. ⌘V profiles layer activations on a small prompt set, derives linear converters between corresponding layers, and applies the donor intervention in the recipient’s activation space. This process does not require access to the original training data and needs minimal compute. In three case studies—safety-refusal enhancement, jailbreak facilitation, and automatic chain-of-thought reasoning—⌘V matches the performance of direct finetuning while using orders of magnitude less resources.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 20993
Loading