Beyond Generic Benchmarks: Evaluating the Structural Misalignment of LLMs in Public-Sector Decision Contexts

ICLR 2026 Conference Submission21930 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Evaluation + High-stakes Use Cases + LLM Alignment
Abstract: The study moves beyond general benchmark to specific use cases and moves beyond accuracy metrics to the outcomes at the populational level, providing the first empirical evaluation of large language models (LLMs) in the application of child maltreatment. To achieve this, we systematically measures the performance of LLMs on child maltreatment related tasks. The evaluation is grounded in the Child Maltreatment 2022 Report, published annually by the U.S. Children’s Bureau, which provides real world national statistics. It consists of every key information related to child maltreatment such as the victim's demographics, forms of maltreatment, and risk factors contributing to child maltreatment. We find that LLMs tend to over-represent certain demographics such as female victims and more severe maltreatment forms such as physical and psychological abuse while the most prevalent form in the benchmark dataset is neglect. The narratives are highly homogeneous in LLM-generated content, both within LLMs and across LLMs, even with the variety of prompts. It indicates that Large Language Models (LLMs) exhibit a cross-model monoculture effect in high-stakes decision-support contexts, producing homogenized and systematically biased outputs that can distort population-level outcomes. The convergence of outputs across architectures and model families demonstrates that monoculture is not an artifact of a single model but an emergent property of current LLMs design and training regimes.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21930
Loading