Abstract: Large Language Models (LLMs) have shown promise in automating high-labor data tasks, but the adoption of LLMs in high-stake scenarios faces two key challenges: their tendency to answer despite uncertainty and their difficulty handling long input contexts robustly.
We investigate commonly used off-the-shelf LLMs' ability to identify low-confidence outputs for human review through "check set selection"--a process where LLMs prioritize information needing human judgment.
Using a case study on social media monitoring for disaster risk management,
we define the “check set” as a list of tweets escalated to the disaster manager when the LLM has the least confidence, enabling human oversight within budgeted effort.
We test two strategies for LLM check set selection: *individual confidence elicitation* -- LLMs assesses confidence for each tweet classification individually, requiring more prompts with shorter contexts, and *direct set confidence elicitation* -- LLM evaluates confidence for a list of tweet classifications at once, using less prompts but longer contexts.
Our results reveal that set selection via individual probabilities is more reliable but that direct set confidence merits further investigation.
Direct set selection challenges include inconsistent outputs, incorrect check set size, and low inter-annotator agreement.
Despite these challenges, our approach improves collaborative disaster tweet classification by outperforming random-sample check set selection, demonstrating the potential of human-LLM collaboration.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: crisis NLP, prompting, LLM evaluation, human-in-the-loop
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: english
Submission Number: 4960
Loading