Abstract: We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso $\ell_1$ regression. LLM-Lasso incorporates domain-specific knowledge extracted from natural language, optionally enhanced through a retrieval-augmented generation (RAG) pipeline, as penalty factors in the Lasso's $\ell_1$ regularization term. This integrates data-driven modeling with metadata from natural language. This is, to our knowledge, the first embedded LLM-driven feature selector. By design, LLM-Lasso addresses some key robustness challenges of LLM-driven feature selection: the risk of LLM hallucinations or low-quality responses. An internal cross-validation step is crucial to LLM-Lasso’s performance, determining how heavily the prediction pipeline relies on the LLM’s outputs. In the case of poor LLM generation quality, the cross-validation procedure allows LLM-Lasso to recover the behavior of the plain Lasso. In some biomedical case studies, LLM-Lasso outperforms standard Lasso and existing feature selection baselines.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: See responses to the reviewers.
Assigned Action Editor: ~Ankit_Singh_Rawat1
Submission Number: 6278
Loading