LLM-Lasso: A Framework for Domain-Informed Feature Selection and Regularization

LLM-Lasso: A Framework for Domain-Informed Feature Selection and Regularization

TMLR Paper6278 Authors

21 Oct 2025 (modified: 24 Jan 2026)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso $\ell_1$ regression. LLM-Lasso incorporates domain-specific knowledge extracted from natural language, optionally enhanced through a retrieval-augmented generation (RAG) pipeline, as penalty factors in the Lasso's $\ell_1$ regularization term. This integrates data-driven modeling with metadata from natural language. This is, to our knowledge, the first embedded LLM-driven feature selector. By design, LLM-Lasso addresses some key robustness challenges of LLM-driven feature selection: the risk of LLM hallucinations or low-quality responses. An internal cross-validation step is crucial to LLM-Lasso’s performance, determining how heavily the prediction pipeline relies on the LLM’s outputs. In the case of poor LLM generation quality, the cross-validation procedure allows LLM-Lasso to recover the behavior of the plain Lasso. In some biomedical case studies, LLM-Lasso outperforms standard Lasso and existing feature selection baselines.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: See responses to the reviewers.

Assigned Action Editor: ~Ankit_Singh_Rawat1

Submission Number: 6278

Loading