Abstract: Large language models (LLMs) often struggle to objectively identify latent characteristics in large datasets due to their reliance on pre-trained knowledge rather than actual data patterns. To address this data grounding issue, we propose Data Scientist AI (DSAI), a framework that enables unbiased and interpretable feature extraction through a multi-stage pipeline with quantifiable prominence metrics for evaluating extracted features. On synthetic datasets with known ground-truth features, DSAI demonstrates high recall in identifying expert-defined features while faithfully reflecting the underlying data. Applications on real-world datasets illustrate the framework's practical utility in uncovering meaningful patterns with minimal expert oversight, supporting use cases such as interpretable classification \footnote{The title of our paper is chosen from multiple candidates based on DSAI-generated criteria.}.
Paper Type: Long
Research Area: Information Extraction
Research Area Keywords: open information extraction, zero/few-shot extraction, event extraction
Contribution Types: NLP engineering experiment
Languages Studied: english
Submission Number: 6692
Loading