Abstract: With the growing adoption of large language models (LLMs) in NLP tasks, concerns about their fairness have intensified. However, most existing fairness benchmarks rely on closed-ended settings, which diverge from real-world open-ended interactions and may introduce position bias and minimum score effects. Additionally, they also tend to overlook factuality considerations rooted in historical, social, physiological factors, and cultural contexts, and rarely include intersectional biases. To address these gaps, we propose F²Bench: An Open-ended Fairness Evaluation Benchmark for LLMs with Factuality Considerations. F²Bench includes 2,568 instances across 10 demographic groups and two open-ended tasks. By incorporating text generation, reasoning, and factuality considerations into a fairness evaluation benchmark, we aim to better reflect the complexities of real-world scenarios. We conduct a comprehensive evaluation of several LLMs of different series and parameter scales, and find that all exhibit varying degrees of fairness issues. We also analyzed the models' different performances, compared the closed-ended and open-ended evaluations, and further proposed new recommendations for future model training. Our benchmark are publicly available at https://anonymous.4open.science/status/F2Bench-5883.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: model bias/fairness evaluation
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 5428
Loading