RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals

David Reber; Sean M Richardson; Todd Nief; Cristina Garbacea; Victor Veitch

RATE: Causal Explainability of Reward Models with Imperfect Counterfactuals

David Reber, Sean M Richardson, Todd Nief, Cristina Garbacea, Victor Veitch

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Explainability of reward models relies on attribute-specific treatment effects, which can be estimated with imperfect counterfactuals by rewriting twice.

Abstract: Reward models are widely used as proxies for human preferences when aligning or evaluating LLMs. However, reward models are black boxes, and it is often unclear what, exactly, they are actually rewarding. In this paper we develop Rewrite-based Attribute Treatment Estimator (RATE) as an effective method for measuring the sensitivity of a reward model to high-level attributes of responses, such as sentiment, helpfulness, or complexity. Importantly, RATE measures the *causal* effect of an attribute on the reward. RATE uses LLMs to rewrite responses to produce imperfect counterfactuals examples that can be used to measure causal effects. A key challenge is that these rewrites are imperfect in a manner that can induce substantial bias in the estimated sensitivity of the reward model to the attribute. The core idea of RATE is to adjust for this imperfect-rewrite effect by rewriting *twice*. We establish the validity of the RATE procedure and show empirically that it is an effective estimator.

Lay Summary: Reward models are used to align large language models (LLMs) with human preferences — for example, preferring responses that are more helpful, polite, or concise. But these models are often opaque: it's hard to tell what features of a response they are actually rewarding. Are they responding to helpfulness, or simply to shorter answers? We introduce RATE, which measures how much specific attributes — like sentiment or length — influence a reward model’s scores. RATE works by rewriting responses to isolate individual attributes and observing how the model's score changes. Because these rewrites are imperfect, RATE applies a correction technique based on rewriting twice, which helps reduce potential measurement errors. This way, when attempts to make an LLM helpful, polite, or concise don't work, we can narrow in on whether it's because the reward model was faulty or something else in the alignment process.

Link To Code: https://github.com/toddnief/RATE

Primary Area: Deep Learning->Large Language Models

Keywords: Explainable AI, Counterfactual Explanations, Causal Inference, Alignment, Large Language Model, NLP

Submission Number: 7895

Loading