Reward hacking is a known challenge in reinforcement learning, where continued optimization after convergence can degrade generation quality. This occurs when models over-fit to semantic alignment while neglecting realistic motion dynamics.
As illustrated in the videos below, models misinterpret prompts to over-fit specific actions. For example, the instruction to "lifts their right foot" may result in continuous, excessive lifting. Similarly, a sequence like "squats down, then stands up and moves forward" might be incorrectly generated as "squats down while moving forward."
Fortunately, this phenomenon can be effectively mitigated. Our method, combined with KL-divergence regularization, shows robust mitigation.