LLMSelect: Knowledge-based Feature Selection with Large Language Models

24 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Large Language Models, Feature Selection, Machine Learning with Prior Knowledge
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
TL;DR: We show that the prior knowledge of pretrained LLMs can be leveraged to select high-signal features for downstream supervised learning tasks.
Abstract: How can we leverage the implicit prior knowledge and reasoning capabilities of large language models (LLMs) for standard supervised learning tasks? In this work, we demonstrate that pretrained LLMs can be used to augment traditional machine learning models by selecting high-signal features without looking at the training data. Providing only the candidate feature names and a minimal description of the prediction task, we prompt the LLM to directly output a set of numerical feature importance scores in text and use them for feature selection. In a series of real-world prediction tasks, we show that LLM-based feature selection can lead to strong downstream predictive performance, competitive with that achieved with standard selection methods such as the LASSO and sequential feature selection. We investigate the sensitivity of this approach to various prompt-design and sampling strategies and to the scale of the pretrained LLM, and find that the simple setting of zero-shot prompting with zero-temperature sampling can be sufficient for strong downstream performance, given a large enough LLM. We also demonstrate that the LLM-generated feature importance scores exhibit nontrivial rank correlation with commonly used feature importance measures such as Shapley values, which illustrate the capabilities of LLMs to effectively distill prior knowledge into meaningful numerical scores.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 8695
Loading