Keywords: Latent Feature Extraction, Open Information Extraction, Bias, Interpretability
TL;DR: DSAI is a framework that enables unbiased, interpretable feature extraction, addressing LLMs' data grounding issues.
Abstract: Large language models (LLMs) often struggle to objectively identify latent characteristics in large datasets due to their reliance on pre-trained knowledge rather than actual data patterns. To address this data grounding issue, we propose Data Scientist AI (DSAI), a framework that enables unbiased and interpretable feature extraction through a multi-stage pipeline with quantifiable prominence metrics for evaluating extracted features. On synthetic datasets with known ground-truth features, DSAI demonstrates high recall in identifying expert-defined features while faithfully reflecting the underlying data. Applications on real-world datasets illustrate the framework's practical utility in uncovering meaningful patterns with minimal expert oversight, supporting use cases such as interpretable classification \footnote{The title of our paper is chosen from multiple candidates based on DSAI-generated criteria.}.
Submission Number: 10
Loading