Evaluating Large Language Models for Confidence-based Check Set Selection

Evaluating Large Language Models for Confidence-based Check Set Selection

ACL ARR 2025 February Submission4960 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have shown promise in automating high-labor data tasks, but the adoption of LLMs in high-stake scenarios faces two key challenges: their tendency to answer despite uncertainty and their difficulty handling long input contexts robustly. We investigate commonly used off-the-shelf LLMs' ability to identify low-confidence outputs for human review through "check set selection"--a process where LLMs prioritize information needing human judgment. Using a case study on social media monitoring for disaster risk management, we define the “check set” as a list of tweets escalated to the disaster manager when the LLM has the least confidence, enabling human oversight within budgeted effort. We test two strategies for LLM check set selection: *individual confidence elicitation* -- LLMs assesses confidence for each tweet classification individually, requiring more prompts with shorter contexts, and *direct set confidence elicitation* -- LLM evaluates confidence for a list of tweet classifications at once, using less prompts but longer contexts. Our results reveal that set selection via individual probabilities is more reliable but that direct set confidence merits further investigation. Direct set selection challenges include inconsistent outputs, incorrect check set size, and low inter-annotator agreement. Despite these challenges, our approach improves collaborative disaster tweet classification by outperforming random-sample check set selection, demonstrating the potential of human-LLM collaboration.

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: crisis NLP, prompting, LLM evaluation, human-in-the-loop

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: english

Submission Number: 4960

Loading