DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI

DSAI: Unbiased and Interpretable Latent Feature Extraction for Data-Centric AI

ACL ARR 2025 February Submission6692 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) often struggle to objectively identify latent characteristics in large datasets due to their reliance on pre-trained knowledge rather than actual data patterns. To address this data grounding issue, we propose Data Scientist AI (DSAI), a framework that enables unbiased and interpretable feature extraction through a multi-stage pipeline with quantifiable prominence metrics for evaluating extracted features. On synthetic datasets with known ground-truth features, DSAI demonstrates high recall in identifying expert-defined features while faithfully reflecting the underlying data. Applications on real-world datasets illustrate the framework's practical utility in uncovering meaningful patterns with minimal expert oversight, supporting use cases such as interpretable classification \footnote{The title of our paper is chosen from multiple candidates based on DSAI-generated criteria.}.

Paper Type: Long

Research Area: Information Extraction

Research Area Keywords: open information extraction, zero/few-shot extraction, event extraction

Contribution Types: NLP engineering experiment

Languages Studied: english

Submission Number: 6692

Loading