Keywords: crowdsourcing, uncertainty quantification, bayesian network, question answering, trustworthy LLM
Abstract: Concerns persist over the trustworthiness of large language models (LLMs) due to the generation of plausible but incorrect information, known as hallucination. Existing approaches focus on identifying false answers or improving correctness by sampling responses from a single LLM. However, querying multiple LLMs, which exhibit complementary strengths, remains largely unexplored. In this work, we propose a Bayesian crowdsourcing approach towards aggregating multiple answers from multiple LLMs and quantifying their uncertainty. Extending the Dawid-Skene model, we treat LLMs as annotators, using their answer probabilities as noisy observations of truthfulness and modeling semantic relations between answers in the covariance structure, and jointly learn about LLM's reliability and calibration as parameters. Validated across three open-domain question answering dataset, results show that our approach outperforms existing statistical or agentic methods in abstaining from false answers and identifying truthful ones, offering a robust, scalable solution for uncertainty quantification and truth discovery in LLM outputs.
Submission Number: 74
Loading