Keywords: We propose techniques for sample-efficient learning that leverage domain knowledge from large language models, and demonstrate significant improvements in clinical prediction models trained on small patient cohorts.
TL;DR: large language models, sample efficiency, clinical prediction, genomics
Abstract: We consider the problem of learning a classifier from a small set of high-dimensional datapoints, with access to domain knowledge from a language model or human expert. How should such domain knowledge be elicited and leveraged together with the data? We investigate four strategies and their combinations, for linear models, the de facto choice in the small-sample regime: (1) feature selection, (2) coefficient sign constraints, (3) inequality constraints on coefficient magnitudes, and (4) equality constraints across related features. Each restricts the hypothesis class, improving generalization in small sample settings, and corresponds to a natural query for an expert or Large Language Model (LLM) that draws on prior knowledge. We evaluate these approaches on four clinical prediction tasks involving high-dimensional molecular measurements, where labeled datasets are typically limited to small patient cohorts. We find models trained with as few as five patients outperform baselines requiring ten times more patients, with improvements upwards of 20\% AUC, exceeding human expert–designed models. This highlights new strategies for LLM-guided model design in data-limited settings.
Supplementary Material: pdf
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23049
Loading