Keywords: large language models, uncertainty, calibration, risk scores, benchmark, census, tabular data
TL;DR: We evaluate risk score distributions generated by LLMs on real-world tasks, and draw insights into LLMs' inability to express aleatoric data uncertainty.
Abstract: Current question-answering benchmarks predominantly focus on accuracy in realizable prediction tasks.
Conditioned on a question and answer-key, does the most likely token match the ground truth?
Such benchmarks necessarily fail to evaluate LLMs' ability to quantify ground-truth outcome uncertainty.
In this work, we focus on the use of LLMs as risk scores for unrealizable prediction tasks.
We introduce folktexts, a software package to systematically generate risk scores using LLMs, and evaluate them against US Census data products.
A flexible API enables the use of different prompting schemes, local or web-hosted models, and diverse census columns that can be used to compose custom prediction tasks.
We evaluate 17 recent LLMs across five proposed benchmark tasks.
We find that zero-shot risk scores produced by multiple-choice question-answering have high predictive signal but are widely miscalibrated.
Base models consistently overestimate outcome uncertainty, while instruction-tuned models underestimate uncertainty and produce over-confident risk scores.
In fact, instruction-tuning polarizes answer distribution regardless of true underlying data uncertainty.
This reveals a general inability of instruction-tuned models to express data uncertainty using multiple-choice answers.
A separate experiment using verbalized chat-style risk queries yields substantially improved calibration across instruction-tuned models.
These differences in ability to quantify data uncertainty cannot be revealed in realizable settings, and highlight a blind-spot in the current evaluation ecosystem that folktexts covers.
Supplementary Material: pdf
Submission Number: 2442
Loading