Automated Prior Elicitation from Large Language Models for Bayesian Logistic Regression

Published: 12 Jul 2024, Last Modified: 12 Aug 2024AutoML 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Bayesian Statistics, Large Language Models
Abstract: We investigate how one can automatically retrieve prior knowledge and use it to improve the sample efficiency of training linear models. This is addressed using the Bayesian formulation of logistic regression, which relies on the specification of a prior distribution that accurately captures the belief the data analyst, or an associated domain expert, has about the values of the model parameters before having seen any data. We develop a broadly applicable strategy for crafting informative priors through the use of Large Language Models (LLMs). The method relies on generating synthetic data using the LLM, and then modelling the distribution over labels that the LLM associates with the generated data. In contrast to existing methods, the proposed approach does not require a substantial time investment from a domain expert and has the potential to leverage access to a much broader range of information. Moreover, our method is straightforward to implement, requiring only the ability to make black-box queries of a pre-trained LLM. The experimental evaluation demonstrates that the proposed approach can have a substantial benefit in some situations, at times achieving an absolute improvement of more than 10\% accuracy in the severely data-scarce regime. We show that such gains can be had even when only a small volume of information is elicited from the LLM.
Submission Checklist: Yes
Broader Impact Statement: Yes
Paper Availability And License: Yes
Code Of Conduct: Yes
Optional Meta-Data For Green-AutoML: All questions below on environmental impact are optional.
Community Implementations: https://www.github.com/henrygouk/llm-prior
Submission Number: 31
Loading