Keywords: identity, race, gender, sexual orientation, survey methodology, SAEs, LLMs
TL;DR: We analyze ~1000 free-text identity responses on race, gender, and sexual orientation using SAEs and LLMs, showing they capture nuances beyond Census categories. Free-text themes differ from conventional categories and improve outcome predictions.
Submission Type: Non-Archival
Abstract: Categories for race, gender, and sexual orientation facilitate computational analysis but may not fully capture people's identities. Here, we investigate the potential of using open-ended, self-described identity responses, combined with modern language modeling techniques, to capture more granular and contextually relevant insights into individuals’ identities. We collect, and will make publicly available, the "In Your Own Words" dataset which includes free-text identity descriptions from 1,004 respondents. Alongside these open-ended responses, the dataset contains categorical identity measures and life outcome variables, such as health, life satisfaction, and experiences with discrimination. Our analysis reveals that free-text identity themes do not fully align with Census categories, demonstrating heterogeneity within and across identity groups. Furthermore, nested F-test analyses indicate that adding free-text identity themes to a category-only model improves predictions of key life outcomes––such as mental health, life satisfaction, and perceptions of discrimination. Overall, our results demonstrate that free-text data, when analyzed with sparse autoencoders (SAEs) and large language models (LLMs), enhances the quantitative study of identity by combining the nuance of qualitative responses with the scalability of computational methods. This work contributes to broader efforts in survey methodology and public opinion research, highlighting the potential of LLMs to capture the social meaning and lived experiences of identity, enabling richer measurement and potentially more accurate modeling of attitudes, well-being, and behavior.
Submission Number: 30
Loading