DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI

Hyowon Cho; Soonwon Ka; Daechul Park; Jaewook Kang; Bokyung Son

DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI

Hyowon Cho, Soonwon Ka, Daechul Park, Jaewook Kang, Bokyung Son

Published: 09 Jun 2025, Last Modified: 08 Jul 2025KDD 2025 Workshop SciSocLLMEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Latent Feature Extraction, Open Information Extraction, Bias, Interpretability

TL;DR: DSAI is a framework that enables unbiased, interpretable feature extraction, addressing LLMs' data grounding issues.

Abstract: Large language models (LLMs) often struggle to objectively identify latent characteristics in large datasets due to their reliance on pre-trained knowledge rather than actual data patterns. To address this data grounding issue, we propose Data Scientist AI (DSAI), a framework that enables unbiased and interpretable feature extraction through a multi-stage pipeline with quantifiable prominence metrics for evaluating extracted features. On synthetic datasets with known ground-truth features, DSAI demonstrates high recall in identifying expert-defined features while faithfully reflecting the underlying data. Applications on real-world datasets illustrate the framework's practical utility in uncovering meaningful patterns with minimal expert oversight, supporting use cases such as interpretable classification \footnote{The title of our paper is chosen from multiple candidates based on DSAI-generated criteria.}.

Submission Number: 10

Loading