Applications of Natural Language Processing and Large Language Models for Social Determinants of Health: A Systematic Review (Preprint)

Swati Rajwal, Avinash Kumar Pandey, Ziyuan Zhang, Yankai Chen, Michael X. Liu, Sudeshna Das, Hannah Rogers, Abeed Sarker, Yunyu Xiao

Published: 08 Sept 2025, Last Modified: 26 Nov 2025CrossrefEveryoneRevisionsCC BY-SA 4.0
Abstract: Background: Social Determinants of Health (SDOH) are the social, economic, and environmental conditions that influence health outcomes. SDOHs are often embedded in unstructured text, such as notes in electronic health records (EHRs) and social media posts. Advances in natural language processing (NLP), particularly the ubiquity of large language models (LLMs), offer emerging opportunities to extract, analyze, and interpret SDOH information from these sources and relate them to clinical outcomes. However, existing NLP studies are scattered across disciplines, use various methodologies, and vary in quality and scope, making it difficult to draw cohesive insights or benchmark progress. Objective: This systematic review aims to identify the current landscape of NLP and LLMs applications in SDOH-related research. Specifically, it identifies common NLP task areas, models, data sources, evaluation practices, and key findings, while highlighting methodological gaps and opportunities for future work. Methods: Following PRISMA guidelines, we searched PubMed, Web of Science, IEEE Xplore, Scopus, PsycINFO, Health Source: Academic Nursing, and ACL Anthology to find studies published in English between 2014 and 2024. Eligible studies used NLP methods, including deep learning, transformer models, and LLMs, to identify, classify, or predict SDOH from text. Screening and data extraction were conducted by independent reviewers, with conflicts resolved by consensus. The review protocol was registered in PROSPERO (registration number: CRD42024578082). Results: 129 studies met the inclusion criteria. We observed a rapid growth in the field since 2021 (79% studies from 2021-2024). EHRs were the most common data source, although access limitations (based on institutions) were frequent. Most studies focused on extraction or classification tasks, using transformer-based models such as BERT (n=28) and large language models (n=13). Housing instability (46.9%), financial context (46.2%), employment (39.2%), substance use (32.3%) and social connection or isolation (31.5%) were among the social determinants of health most studied. Although several studies reported strong model performance, reproducibility remains limited due to restricted data and code availability. Conclusions: This systematic review highlights the expanding role of NLP and LLMs in SDOH research and the potential to support scalable, data-driven approaches to address health disparities. Future work should prioritize longitudinal analysis, structural determinants, public benchmarks, and real-world implementation. Enhancing transparency and inclusivity in datasets and model development is critical to realizing the promise of NLP for equitable health outcomes. Clinical Trial: International Registered Report Identifier (IRRID): DERR1-10.2196/66094 Review registered on PROSPERO 2024 CRD42024578082 JMIR Research Protocol 2025;14:e66094 doi:10.2196/66094
Loading