When Inference-Time Reward Steering Hacks the Reward
Keywords: Inference-time steering, Reward Hacking, Flow Map, Image Generation
Abstract: We characterise inference-time reward steering on a class-conditional flow-map sampler (Potaptchik et al., 2026) across 50 ImageNet (target, prompt) pairs spanning a range of CLIP text-distance. Two regimes emerge. When the prompt class is close to the conditioning class, the gradient moves the trajectory between classes legitimately (panda→lion: held-out ViT P(lion)=0.81). Past a threshold in CLIP text-distance, the steered samples retain a high reward score (HPSv2 0.34, near the matched-class value) but are rejected by two held-out classifiers (P(prompt) ≈ 0.14), independently flagged by ImageReward (−0.71 vs. matched +0.30), and lie far from either class manifold in DINOv2 feature space. The pattern replicates when the steering reward is PickScore. We further report a classifier-validity tilt r+λπ thattrades steering-reward magnitude for class preservation along a well-behaved λ-Pareto curve; we frame it as a tradeoff, not a fix. All four audit signals share a CLIP-style pretraining ecosystem, so we treat them as proxies rather than ground truth and present the work as a case study under one generative setup.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 206
Loading