F2Bench: An Open-ended Fairness Evaluation Benchmark for LLMs with Factuality Considerations

F2Bench: An Open-ended Fairness Evaluation Benchmark for LLMs with Factuality Considerations

ACL ARR 2025 May Submission5428 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: With the growing adoption of large language models (LLMs) in NLP tasks, concerns about their fairness have intensified. However, most existing fairness benchmarks rely on closed-ended settings, which diverge from real-world open-ended interactions and may introduce position bias and minimum score effects. Additionally, they also tend to overlook factuality considerations rooted in historical, social, physiological factors, and cultural contexts, and rarely include intersectional biases. To address these gaps, we propose F²Bench: An Open-ended Fairness Evaluation Benchmark for LLMs with Factuality Considerations. F²Bench includes 2,568 instances across 10 demographic groups and two open-ended tasks. By incorporating text generation, reasoning, and factuality considerations into a fairness evaluation benchmark, we aim to better reflect the complexities of real-world scenarios. We conduct a comprehensive evaluation of several LLMs of different series and parameter scales, and find that all exhibit varying degrees of fairness issues. We also analyzed the models' different performances, compared the closed-ended and open-ended evaluations, and further proposed new recommendations for future model training. Our benchmark are publicly available at https://anonymous.4open.science/status/F2Bench-5883.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: model bias/fairness evaluation

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 5428

Loading