An Agentic System for Automated Data Curation and Analysis in Large-Scale Biobanks
Keywords: AI Agent, UK Biobank, Large Language Model, Phenotyping
Track: Proceedings
Abstract: The translation of clinical and lifestyle concepts into computable phenotypes is a critical yet manually intensive bottleneck in leveraging large-scale biomedical datasets like the UK Biobank. This process is slow, requires deep domain expertise, and suffers from a lack of scalability and reproducibility, especially for clinicians unfamiliar with large-scale data analysis. We propose and develop an autonomous, dual-component agentic system designed to automate the research workflow from hypothesis to report. The first component, the large language model (LLM)-based data preprocessing framework, systematically searches the UK Biobank's public data dictionary, translating high-level clinical and lifestyle concepts into machine-readable rules. The second component, the Analysis Agent, autonomously executes the statistical analysis plan and synthesizes the findings. The framework is further validated by successfully phenotyping and analyzing several clinical and lifestyle screeners. This work demonstrates a viable end-to-end system that enhances scalability and democratizes complex data analysis with transparency, representing a foundational step toward a new paradigm of AI-driven scientific discovery.
General Area: Applications and Practice
Specific Subject Areas: Natural Language Processing
PDF: pdf
Supplementary Material: pdf
Data And Code Availability: Yes
Ethics Board Approval: No
Entered Conflicts: I confirm the above
Anonymity: I confirm the above
Code URL: https://github.com/ukjung21/ukb-agent
Submission Number: 175
Loading