PhonATe: Impact of Type-Written Phonological Features of African American Language on Generative Language Modeling Tasks

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0
Research Area: Data, Societal implications, LMs for everyone
Keywords: data augmentation, african american language, bias and fairness, language generation
TL;DR: Fine-tuning and ablation experiments with synthetic type-written phonological features of AAL (e.g., "goin") reveal that these features are a vital consideration to understanding perceived model biases.
Abstract: Current Large Language Models perform poorly on African American Language (AAL) texts in tasks like toxicity detection and sentiment analysis. AAL is underrepresented in both pre-training data and existing benchmarks for these tasks, hindering thorough evaluation and understanding of these biases. We introduce a novel approach to synthetically introduce type-written phonological features of AAL into text, a class of AAL features that has been overlooked in prior work. Our goal is to better understand how these features affect generative language models' performance on three tasks: toxicity detection, sentiment analysis, and masked span prediction. We find that fine-tuning with synthetic type-written phonological features lowers perceived biases on downstream tasks and our ablations reveal which features have particularly large negative impacts on model performance. Our results suggest that phonological features are vital to consider when designing bias mitigation techniques.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html
Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html
Submission Number: 1359
Loading