Abstract: Subjective evaluation tasks including critical analysis and rating remain at the top of Bloom’s Taxonomy. These have emerged as new pathways for evaluating Language Models (LMs) wherein correctness is relative. While LMs present diverse and human-aligned opinions on such tasks, their confidence and reliability in opinions remains unexplored. We take a deeper look at the reliability of LMs for subjective evaluations by selecting one such task of focus group surveys. LMs act as participants by completing survey questionnaires of diverse physical products. Participants must verbalize their opinions and product details in order to aid business organizations in their commercial goals. While survey responses are diverse, detailed and aligned with human intent, participants are found to be overconfident in their responses. Models often confabulate product appearance, shape and haptic feedback with high self-reported confidence. We address overconfidence by taking a surgical approach. We uncover that (1) choice of prompt prefix and (2) steering guidance at earlier layers are pivotal in mitigating overconfidence. Following our desiderata of participants to possess long-term awareness and diversity in viewpoints, we propose a framework that minimizes overconfidence using prefix intensity and teacher-guided steering. Our collective recommendations, termed the Over-Confidence Checklist (OCC), aid in minimizing and customizing rating confidence into pre-determined quantiles. We empirically validate that following the OCC leads to reliable confidence ratings while grounding response in truthful product-specific details. Survey datasets and code will be released in the final version.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Xuanjing_Huang1
Submission Number: 8941
Loading