TL;DR: A general-purpose steering method to safely reduce errors in language models with calibrated constraints
Abstract: We introduce Mechanistic Error Reduction with Abstention (MERA), a principled framework for steering language models (LMs) to mitigate errors through selective, adaptive interventions. Unlike existing methods that rely on fixed, manually tuned steering strengths, often resulting in under or oversteering, MERA addresses these limitations by (i) optimising the intervention direction, and (ii) calibrating when and how much to steer, thereby provably improving performance or abstaining when no confident correction is possible. Experiments across diverse datasets and LM families demonstrate safe, effective, non-degrading error correction and that MERA outperforms existing baselines. Moreover, MERA can be applied on top of existing steering techniques to further enhance their performance, establishing it as a general-purpose and efficient approach to mechanistic activation steering.
Lay Summary: Language models (LM) often make mistakes, even on very simple tasks like choosing the right answer in a multiple-choice question. Fixing these errors is challenging: current methods often require re-training or rely on trial-and-error when prompting, which can be costly and unreliable.
Our method, MERA, takes a different approach. It uses a lightweight helper model to first estimate how likely the LM is to be wrong, and then gently shifts the model’s internal activity to reduce the probability of an error being made. If the input is already predicted to be correct, then MERA does nothing. But if the helper model is confident that the language model will make a mistake, we steer the LM with MERA.
We tested our approach on simple multi-choice tasks across several LMs and found that it generally helps improve accuracy. This makes MERA an efficient, and practical method to apply after training.
Link To Code: https://github.com/annahedstroem/MERA-steering
Primary Area: Deep Learning->Large Language Models
Keywords: mechanistic steering, intervention, error mitigation, language models
Submission Number: 4544
Loading