Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, yet they occasionally exhibit sycophantic behavior, generating responses that align with or agree with a user’s stated opinions or preferences, even when those opinions are incorrect or biased. This sycophantic tendency can undermine the trustworthiness and reliability of LLMs. This work proposes a novel approach to mitigate sycophancy in LLMs by fine-tuning them on a carefully curated dataset comprising prompts paired with sycophantic and non-sycophantic responses 1. Our method leverages Direct Preference Optimization (DPO), which optimizes LLMs to generate responses that align with the preferred (non-sycophantic) outputs without requiring explicit reward modeling. We develop a dataset of 1000 prompts with sycophantic and non-sycophantic responses to fine-tune LLMs. Our approach achieves an average reduction of 85% in persona-based tests and 84% in preference-driven tests, demonstrating significant mitigation of sycophantic behaviors. Our findings pave the way for more trustworthy and reliable language models that can provide objective and unbiased responses, aligning with human preferences while maintaining factual accuracy.
Loading