A State-Vector Framework for Dataset Effects

Esmat Sahak; Zining Zhu; Frank Rudzicz

A State-Vector Framework for Dataset Effects

Esmat Sahak, Zining Zhu, Frank Rudzicz

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 MainEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Interpretability, Interactivity, and Analysis of Models for NLP

Submission Track 2: Language Modeling and Analysis of Language Models

Keywords: data influence, probing, fine-tuning, multi-task learning, datasets

TL;DR: We propose a state-vector framework to analyze the effects of datasets on deep learning language models using probing tasks.

Abstract: The impressive success of recent deep neural network (DNN)-based systems is significantly influenced by the high-quality datasets used in training. However, the effects of the datasets, especially how they interact with each other, remain underexplored. We propose a state-vector framework to enable rigorous studies in this direction. This framework uses idealized probing test results as the bases of a vector space. This framework allows us to quantify the effects of both standalone and interacting datasets. We show that the significant effects of some commonly-used language understanding datasets are characteristic and are concentrated on a few linguistic dimensions. Additionally, we observe some ``spill-over'' effects: the datasets could impact the models along dimensions that may seem unrelated to the intended tasks. Our state-vector framework paves the way for a systematic understanding of the dataset effects, a crucial component in responsible and robust model development.

Submission Number: 4422

Loading