EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models

EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models

ICLR 2026 Conference Submission16426 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sycophancy, Large Vision-Language Models, Medical VQA Benchmark

Abstract: Recent benchmarks for medical Large Vision-Language Models (LVLMs) primarily focus on task-specific performance metrics, such as accuracy in visual question answering. However, focusing exclusively on leaderboard accuracy risks neglecting critical issues related to model reliability and safety in practical diagnostic scenarios. One significant yet underexplored issue is sycophancy — the propensity of models to uncritically align with user-provided information, thereby creating an echo chamber that amplifies rather than mitigates user biases. While previous studies have investigated sycophantic behavior in text-only large language models (LLMs), its manifestation in LVLMs, particularly within high-stakes medical contexts, remains largely unexplored. To address this gap, we introduce EchoBench, which is, to the best of our knowledge, the first benchmark specifically designed to systematically evaluate sycophantic tendencies in medical LVLMs. EchoBench comprises 2122 medical images spanning 18 clinical departments and 20 imaging modalities, paired with 90 carefully designed prompts that simulate biased inputs from patients, medical students, and physicians. In addition to assessing overall sycophancy rates, we conducted fine-grained analyses across bias types, clinical departments, perceptual granularity, and imaging modalities. We evaluated a range of advanced LVLMs, including medical-specific, open-source, and proprietary models. Our results reveal substantial sycophantic tendencies across all evaluated models. The best-performing proprietary model, Claude 3.7 Sonnet, still exhibits a non-trivial sycophancy rate of 45.98%. Even the most recently released GPT-4.1 demonstrates a higher sycophancy rate of 59.15%. Notably, most medical-specific models exhibit extremely high sycophancy rates (above 95%) while achieving only moderate accuracy. Our findings indicate that sycophancy is a widespread and persistent issue in current medical LVLMs, uncovering several key factors that shape model susceptibility to sycophantic behaviors. Detailed analyses of experimental results reveal that building high-quality medical training datasets that span diverse dimensions and enhancing domain knowledge are essential for mitigating these sycophantic tendencies in medical LVLMs.

Primary Area: datasets and benchmarks

Submission Number: 16426

Loading