Keywords: LLM, Public Service, Citizen Query, Misinformation, Trustworthiness
TL;DR: Introducing a dataset of 7.5k "citizen query" prompt-response pairs and utilising a range of evaluation techniques to elicit preliminary information about LLM usefulness in UK-government public information tasks
Abstract: "Citizen queries" are questions about government policies, guidance, and services relevant to an individual's circumstances. LLM-powered chatbots have a number of strengths that make them the obvious future for citizen query-answering, but hallucinated or outdated answers can cause significant harm to askers in such a sensitive context. We introduce OpenGovCorpus-UK and OpenGovCorpus-eval: a 7.5k-Q\&A-pair benchmark synthesized from $gov.uk$, and its use in an evaluation framework for LLMs in citizen-query tasks. The protocol spans three evaluator classes ((1) open-weights models, (2) GPT-family models, and (3) human judgment) combining a persona-aware Metadata Grader, embedding- and token-level Semantic Similarity, and \LLM-as-a-Judge with pass-rate aggregation. Results show strong few-shot gains, context and persona mismatches not captured by similarity metrics alone, and variation across families of closed/open models. We provide a reproducible procedure and thresholds suitable for lifecycle monitoring as policies and models evolve, supporting evidence-based public sector deployment for the future of trustworthy LLMs in government services.
Submission Number: 231
Loading