Latent Feature Mining for Predictive Model Enhancement with Large Language Models

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: data mining, large language models, feature extraction, criminal justice, AI for social good, AI for healthcare
TL;DR: We propose a novel framework that leverages large language models to mine latent features through text-to-text propositional reasoning, enhancing the performance predictive model in data-constrained domains like criminal justice and healthcare.
Abstract:

Predictive modeling often faces challenges due to limited data availability and quality, especially in domains where collected features are weakly correlated with outcomes and where additional data collection is constrained by ethical or practical difficulties. Traditional machine learning (ML) models struggle to incorporate unobserved yet critical factors. In this work, we introduce a novel approach to formulate latent feature mining as text-to-text propositional logical reasoning. We propose FLAME (Faithful Latent FeAture Mining for Predictive Model Enhancement), a framework that leverages large language models (LLMs) to augment observed features with latent features, enhancing the predictive power of ML models in downstream tasks. Our novel approach transforms the latent feature extraction task to a text-to-text propositional reasoning task. Our framework is generalizable across various domains with minimal domain-specific customization, ensuring easy transfer to other areas facing similar challenges in data availability. We validate our framework with two case studies: (1) the criminal justice system, a domain characterized by limited and ethically challenging data collection. (2) the healthcare domain, where patient privacy concerns and the complexity of medical data often limit comprehensive feature collection. Our results show that inferred latent features align well with ground truth labels and significantly enhance the downstream classifier.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 11865
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview