Semi-Automatic Extraction and Analysis of Health Equity Covariates in Registered Research Projects

Navapat Nananukul, Mayank Kejriwal

Published: 07 Nov 2025, Last Modified: 14 Jan 2026Applied SciencesEveryoneRevisionsCC BY-SA 4.0
Abstract: Advancing health equity requires rigorous analysis of how research initiatives incorporate and address structural disparities across populations. In this study, we apply large language models (LLMs) to systematically analyze research projects registered on the All of Us platform, with a focus on identifying patterns and institutional dynamics associated with health equity research. We examine the relationship between projects that explicitly pursue health equity goals and their use of available demographic data, their institutional composition (e.g., single- vs. multi-institutional teams), and the research tier of participating institutions (R1 vs. R2). Using the capabilities of an established LLM, we automate key tasks including the extraction of relevant attributes from unstructured project descriptions, classification of institutional affiliations, and the summarization of project content into standardized keywords from the Unified Medical Language System vocabulary. This LLM-assisted pipeline enabled scalable, replicable analysis of hundreds of projects with minimal manual overhead. Our findings suggest a strong association between the use of demographic data and health equity aims, and indicate nuanced differences in equity-oriented research participation by institution type and collaborative structure. More broadly, our approach demonstrates how LLMs can support equity-focused computational social science by transforming free-text administrative data into analyzable structures, enabling novel insights in public health, team science, and science-of-science studies.
Loading