Keywords: bias, fairness, harms, evaluation, real-world applications
Abstract: Large language models (LLMs) are increasingly being deployed in high-stakes settings like hiring, yet their potential for unfair decision-making remains understudied in generation and retrieval. In this work, we examine the allocational fairness of LLM-based hiring systems through two tasks that reflect actual HR usage (resume summarization and applicant ranking), using a synthetic resume dataset with demographic perturbations and curated job postings. Our findings reveal that generated summaries exhibit meaningful differences more frequently for race than for gender perturbations. Additionally, retrieval models exhibit high ranking sensitivity to both gender and race perturbations, and can show comparable sensitivity to both demographic and non-demographic changes. Overall, our results indicate that LLM-based hiring systems, especially in the retrieval stage, can exhibit notable biases that lead to discriminatory outcomes.
Submission Number: 182
Loading