Not Human, Not Justitia: Can LLMs Replace Judges? A Fairness Perspective

Not Human, Not Justitia: Can LLMs Replace Judges? A Fairness Perspective

ACL ARR 2025 February Submission8236 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: With the enhancement of large language model (LLM) application capabilities, their use in social life has become more widespread. However, when assessing whether a model is suitable for human life, except for measuring the accuracy of the model, it is also worth noting whether the model's intervention in human life will bring about societal biases. Judicial fairness is a prerequisite for social justice. When LLMs act as judges, the ability to fairly resolve judicial issues is a prerequisite for ensuring the model is trustworthy. Based on this, we introduce the theory of judicial fairness and construct a framework for measuring the fairness of LLMs. With this framework as basis, we provide 65 labels and 161 label values as measurement indicators, and construct a dataset of 177,100 legal decisions. We test 16 LLMs and carry out comparative experiments based on temperature and case type. Based on extensive experiments and statistical significance tests, we find that existing LLMs cannot achieve judicial fairness, and factors such as model size and temperature values do not have a significant impact on model bias. Our research has a significant impact on the training and application of LLMs in the future. We build a toolkit\footnote{https://anonymous.4open.science/r/LLM-Fairness-4673/README.md} with all the data and code to facilitate future researchers in measuring the fairness of LLMs.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: model bias/fairness evaluation, ethical considerations in NLP applications

Contribution Types: Data resources

Languages Studied: Chinese

Submission Number: 8236

Loading