Learning from Few Samples with Language-Model Guidance

Learning from Few Samples with Language-Model Guidance

ICLR 2026 Conference Submission23049 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: We propose techniques for sample-efficient learning that leverage domain knowledge from large language models, and demonstrate significant improvements in clinical prediction models trained on small patient cohorts.

TL;DR: large language models, sample efficiency, clinical prediction, genomics

Abstract: We consider the problem of learning a classifier from a small set of high-dimensional datapoints, with access to domain knowledge from a language model or human expert. How should such domain knowledge be elicited and leveraged together with the data? We investigate four strategies and their combinations, for linear models, the de facto choice in the small-sample regime: (1) feature selection, (2) coefficient sign constraints, (3) inequality constraints on coefficient magnitudes, and (4) equality constraints across related features. Each restricts the hypothesis class, improving generalization in small sample settings, and corresponds to a natural query for an expert or Large Language Model (LLM) that draws on prior knowledge. We evaluate these approaches on four clinical prediction tasks involving high-dimensional molecular measurements, where labeled datasets are typically limited to small patient cohorts. We find models trained with as few as five patients outperform baselines requiring ten times more patients, with improvements upwards of 20\% AUC, exceeding human expert–designed models. This highlights new strategies for LLM-guided model design in data-limited settings.

Supplementary Material: pdf

Primary Area: foundation or frontier models, including LLMs

Submission Number: 23049

Loading