Keywords: Controllability, inference-time alignment, constraints, tractable probabilistic reasoning
TL;DR: We leverage the gradient information of a constraint verifier to efficiently perform probabilistic inference over *all* generations satisfying the constraint, enabling precise steering of LM generations.
Abstract: Semantic control entails steering LM generations towards satisfying subtle non-lexical
constraints, e.g., toxicity, sentiment, or politeness, attributes that
can be captured by a sequence-level *verifier*.
It can thus be viewed as sampling from the LM distribution conditioned on the target
attribute, a computationally intractable problem due to the non-decomposable nature
of the verifier.
Existing approaches to LM control either only deal with syntactic constraints which
cannot capture the aforementioned attributes, or rely on sampling to explore the
conditional LM distribution, an ineffective estimator for low-probability events.
In this work, we leverage a verifier's gradient information to efficiently reason
over *all* generations that satisfy the target attribute, enabling precise
steering of LM generations by reweighing the next-token distribution.
Starting from an initial sample, we create a local LM distribution favoring semantically
similar sentences.
This approximation enables the tractable computation of an *expected sentence embedding*.
We use this expected embedding, informed by the verifier's evaluation at the initial sample, to estimate the probability of satisfying the constraint, which directly informs the update to the next-token distribution.
We evaluated our approach on the tasks of controlling the toxicity, sentiment, and topic-adherence of LMs yielding generations satisfying the constraint with high probability without degrading their quality.
Submission Number: 119
Loading