Keywords: LLM-as-a-Judge, Reward Models, Reasoning Verification, Reward Hacking, Adversarial Evaluation
TL;DR: We introduce FluffInjector, a benchmark showing that LLM judges often accept logically broken reasoning, and we train SmartRM to robustly detect such failures.
Abstract: Large Language Models (LLMs) are increasingly used as judges and reward models in alignment pipelines, where their scores shape learned behavior. Prior work shows these judges can be manipulated by superficial openers (e.g.,"Thought process:” or “Let’s solve this step by step.”), but vulnerabilities in intermediate reasoning verification remain underexplored. We identify Fluff Injection, a failure in which a logically necessary step in a chain of reasoning is replaced with plausible-sounding commentary (e.g.“Let’s slow down and check our negatives here”). To measure this failure mode, we introduce FluffInjector, a benchmark of paired minimal examples: for each problem, we generate a GOOD chain and a FLUFF chain that keeps the same step count and final answer while replacing 25-40% of steps with non-inferential filler. Evaluating frontier judges (GPT-4.1, DeepSeek-V3.1, Qwen2.5-7B-Instruct), we find they frequently validate FLUFFED chains, indicating a strong reliance on surface coherence. Using FluffInjector, we fine-tune SmartRM, a verifier trained to emphasize step-to-step logical continuity. SmartRM reduces false positives from 37.43% (GPT-4.1) to 2.68% and achieves 97.27% overall verification accuracy.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 95
Loading