Masked and Fair? Identifying and Mitigating Gender Cues in Academic Recommendation Letters with Interpretable NLP
Abstract: Letters of recommendation (LoRs) can carry patterns of gendered language that can inadvertently influence downstream decisions, e.g. in hiring and admissions. In this work, we investigate the extent that Transformer-based Large Language Models (LLMs) can infer the gender of applicants in academic LoRs after explicit identifiers like names and pronouns are de-gendered. When fine-tuning three LLMs (DistilBERT, RoBERTa, and Llama 2) to classify the gender of anonymized and de-gendered LoRs, we find significant gender leakage evidenced by up to 68% classification accuracy. Using text interpretation methods, TF-IDF and SHAP, we demonstrate that certain linguistic patterns are strong proxies for gender, e.g. “emotional” and “humanitarian” are commonly associated with LoRs for female applicants. As an experiment in creating truly gender-neutral LoRs, we remove these implicit gender cues and observed a drop of up to 7% accuracy and 4% macro F1 score on re-training the classifiers. However, applicant gender prediction still remains better than chance. Our findings highlight that LoRs contain gender-identifying cues that are hard to remove and may activate bias in decision-making. While technical solutions may be a concrete step toward fairer academic and professional evaluations, future work is needed to ensure gender-agnostic LoR review.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: data ethics, model bias/fairness evaluation, model bias/unfairness mitigation, ethical considerations in NLP applications, transparency, reflections and critiques
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: english
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: Yes
A2 Elaboration: Yes, we discuss several potential risks of our work in Section 6. We highlight that removing implicitly gendered tokens can compromise the semantic integrity and evaluative clarity of letters, making them difficult for humans to interpret. We also note that SHAP-based interpretations can be imprecise due to subtokenization artifacts. More broadly, we caution that technical de-gendering approaches may risk trivializing applicant identities and unintentionally bypassing necessary institutional changes around fairness and inclusion.
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Yes, we cited the creators of all scientific artifacts we used, including pre-trained models, interpretability tools, datasets, and lexical resources. These citations appear throughout the Methodology (Section 3), Experiments (Section 4), and Related Works (Section 2), and are listed in full in the References section.
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Yes. We ensured that all external artifacts used in our work were consistent with their original terms of use and intended for non-commercial research. The pre-trained models we used—DistilBERT and RoBERTa—are publicly available under open licenses that permit research use. Llama 2 was accessed under the terms of Meta’s license agreement for academic research. We also used gendered word lists from Bias-BERT and GN-GloVe, which are distributed for academic purposes. All components in our pipeline were used in accordance with these intended uses. We specify this in section E of the Appendix.
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Yes. We ensured that the Llama model is used in accordance with its intended research or academic use. We discuss this section E of the Appendix. The pre-trained models we used—DistilBERT and RoBERTa—are released for research purposes, and our use was consistent with those terms. Similarly, the gendered word lists from Bias-BERT and GN-GloVe were used only for academic research. For the artifacts we created, including de-gendered datasets and model outputs, we did not redistribute them and used them exclusively within the scope of this study. All usage remained within the bounds of the original access conditions.
B4 Data Contains Personally Identifying Info Or Offensive Content: Yes
B4 Elaboration: Yes. We discuss our data anonymization procedure in Section 4.1.2. All applicant names in the recommendation letters were replaced with special tokens (e.g., FIRST_NAME, LAST_NAME) to remove direct identifiers. These steps were taken to ensure that the data does not contain personally identifiable information. Offensive content was not a known concern in this dataset, as it consists of formal academic recommendation letters. The project also received IRB approval, confirming that appropriate steps were taken to protect individuals' privacy.
B5 Documentation Of Artifacts: Yes
B5 Elaboration: Yes, for the Llama2 model, we refer to the comprehensive documentation provided by its creators, Meta AI, in their official publication and on their dedicated Llama webpage. We also provide details on the specific version (Llama2-7B-chat) and its access via the Hugging Face transformers library in Section E of the appendix. We did not provide formal documentation of the datasets, such as coverage of domains, languages, or demographic groups. While we describe the dataset as consisting of 8,992 recommendation letters submitted to a U.S. anesthesiology residency program (Section 4.1.1), our analysis focused specifically on gendered language. As such, we did not document other attributes beyond what was directly relevant to our study objectives.
B6 Statistics For Data: Yes
B6 Elaboration: Yes. We report relevant dataset statistics in Section 4.1.1, including the total number of recommendation letters (8,992), the gender distribution of applicants (31% female, 69% male), and the context of data collection (U.S. anesthesiology residency program). In Section 3.2.3, we describe our data processing pipeline and specify the train/validation/test split as 80/10/10. These details provide sufficient context for understanding how the data was used in our experiments.
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Yes. We report the number of parameters for the models used—DistilBERT (~66M), RoBERTa, and Llama 2 (7B)—in Section 3.2.1. We do mention that experiments were run on a single NVIDIA A100 (40 GB) GPU (Section 4.1.3). We do not report the total computational budget (e.g., GPU hours) used for training and evaluation. Our focus was on methodological analysis rather than benchmarking computational efficiency.
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Yes. We discuss our experimental setup in Sections 3.2.3 and 4.1.3, including details of our training procedure and hyperparameter settings. We describe batch sizes, learning rates, number of epochs, optimizer settings, and use of class weights to address label imbalance. In Appendix A, we also report the best-found hyperparameter values for each model after a limited grid search.
C3 Descriptive Statistics: No
C3 Elaboration: No. We report results from a single run per model configuration and do not include error bars or summary statistics across multiple runs. The reported metrics—accuracy, precision, recall, macro F1, and weighted F1—reflect the outcome of one training run on a fixed train/validation/test split. While we conducted hyperparameter searches and selected the best-performing configurations across runs, the results presented are from individual final runs. Given the exploratory nature of our work—aimed at understanding gender leakage and model interpretability rather than establishing benchmarks—we prioritized clarity in comparing de-gendering conditions over quantifying variance.
C4 Parameters For Packages: Yes
C4 Elaboration: Yes. We detail the implementation frameworks used in our experiments in the Computational Frameworks section of the appendix. This includes HuggingFace Transformers for model loading and tokenization, PyTorch for fine-tuning, scikit-learn for evaluation metrics and TF-IDF, and SHAP for model interpretability. Default parameter settings were used unless otherwise specified, and all tools are cited in the references. Given the exploratory nature of our work—centered on understanding gender leakage and interpretability rather than benchmarking—we believe the description provided in Section D of the appendix offers sufficient clarity and reproducibility.
D Human Subjects Including Annotators: Yes
D1 Instructions Given To Participants: No
D1 Elaboration: No. We did not collect new data or conduct studies involving human participants or annotators, so no instructions or disclaimers were provided or required. Our work relied solely on an existing, anonymized dataset approved for research use.
D2 Recruitment And Payment: No
D2 Elaboration: No. We did not recruit or compensate any participants. Our work relied solely on an existing, anonymized dataset approved for research use. All analysis was conducted on a pre-existing, anonymized dataset collected under prior IRB approval.
D3 Data Consent: Yes
D3 Elaboration: Yes. We note in the paper that the dataset of recommendation letters was collected under IRB approval, and all applicant names were anonymized before analysis (Section 6: Ethical Considerations). This implies that appropriate consent procedures were followed during the original data collection process.
D4 Ethics Review Board Approval: Yes
D4 Elaboration: Yes. As stated in the Ethical Considerations section (Section 6), the collection of recommendation letter data was approved by an Institutional Review Board (IRB) prior to analysis.
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: Yes
E1 Elaboration: We did not include this information in the main paper, as our use of AI tools falls under appropriate categories described in the ACL Policy on AI Writing Assistance. Specifically, we used AI tools (e.g., ChatGPT) to assist with language polishing and code development. No AI-generated content was included verbatim in the paper, and all writing and code were reviewed and authored by the research team. We disclose this use in the checklist in accordance with ARR guidelines.
Author Submission Checklist: yes
Submission Number: 922
Loading