Evaluating the Promise and Pitfalls of Using LLMs in Hiring Decisions

Evaluating the Promise and Pitfalls of Using LLMs in Hiring Decisions

ICLR 2026 Conference Submission21382 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: algorithmic fairness, LLMs, hiring, resume screening, bias audits, bias

TL;DR: On ~10k real resume–job pairs, our domain-specific Match Score beats SOTA LLMs and sustains near-parity fairness (min IR ≈0.95 vs ≈0.80), cautioning against using off-the-shelf LLMs in hiring.

Abstract: Large Language Models (LLMs) hold promise for automating candidate screening in hiring, but their deployment raises serious concerns about predictive accuracy and algorithmic bias. In this work, we benchmark several state-of-the-art foundational LLMs including models from OpenAI, Anthropic, Google, Meta, and Deepseek, and compare them with a domain-specific hiring model (Match Score) for job candidate matching. We evaluate each model’s predictive accuracy (ROC AUC, Precision-Recall AUC, F1-score) and fairness (impact ratio of cut-off analysis across declared gender, race, and intersectional subgroups). Our experiments on a dataset of roughly 10,000 real-world recent candidate-job pairs show that Match Score outperforms the general-purpose LLMs on accuracy (ROC AUC 0.85 vs 0.77) and achieves significantly more equitable outcomes across demographic groups. Notably, Match Score attains a minimum race-wise impact ratio of 0.957 (near-parity), versus 0.809 or lower for the best LLMs, (0.906 vs 0.773 for the intersectionals, respectively). We trace this gap to biases in LLM pretraining: even advanced LLMs can propagate societal biases from their training data if not adequately aligned. In contrast, the Match Score model’s task-specific training and bias-mitigation design help it avoid such pitfalls. Furthermore, we show with empirical evidence that there shouldn’t be a dichotomy between choosing accuracy and fairness in hiring: a well-designed algorithm can achieve both accuracy in hiring and fairness in outcomes. These findings highlight the importance of domain-adapted models and rigorous bias auditing for responsible AI deployment in hiring.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 21382

Loading