Track: long paper (up to 10 pages)
Keywords: reasoning, benchmarking, steering
Abstract: Large Language Models (LLMs) are increasingly used for mathematical assistance and evaluation, yet they often exhibit sycophancy: bending reasoning or judgments toward a user’s stated beliefs or preferred answers at the expense of correctness.
While the effect was thoroughly studied in classical applications of LLMs, potential drawbacks in reasoning tasks were not obvious.
In this work, we propose benchmarks of this failure mode in two mathematical reasoning settings: multimodal solution grading and fake-task solving.
For the latter, we introduce a scalable construction of contradictory problems which is based on iGSM. For example, GPT 5.2 (High) exhibited 36.03% sycophantic behavior on synthetic fake tasks (70.24% excluding the samples where the model was not competent enough).
Leveraging this benchmark, we find that sycophancy in reasoning models is common and, importantly, is amplified by RLHF (Reinforcement Learning from Human Feedback). Applying a state-of-the-art preference optimization procedure (SimPO) increases the amount of sycophantic failures.
Finally, we show sycophancy can be reduced using popular method of mechanistic interpretability: steering vectors.
Our findings underscore an important weakness in LLM reasoning and propose a step towards getting rid of this issue.
More broadly, our work questions the post-training lifecycle of modern reasoning LLMs.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 139
Loading