Wrapper Boxes for Faithful Attribution of Model Predictions to Training Data

Yiheng Su; Junyi Jessy Li; Matthew Lease

Wrapper Boxes for Faithful Attribution of Model Predictions to Training Data

Yiheng Su, Junyi Jessy Li, Matthew Lease

Published: 21 Sept 2024, Last Modified: 04 Oct 2024BlackboxNLP 2024 ARR SubmissionsEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Interpretability and Analysis of Models for NLP, Language Modeling, NLP Applications

TL;DR: We present the wrapper box pipeline to combine neural performance with faithful attribution of model decisions to training Data.

Abstract: Can we preserve the accuracy of neural models while also providing faithful explanations of model decisions to training data? We propose a "wrapper box'' pipeline: training a neural model as usual and then using its learned feature representation in classic, interpretable models to perform prediction. Across seven language models of varying sizes, including four large language models (LLMs), two datasets at different scales, three classic models, and four evaluation metrics, we first show that the predictive performance of wrapper classic models is largely comparable to the original neural models. Because classic models are transparent, each model decision is determined by a known set of training examples that can be directly shown to users. Our pipeline thus preserves the predictive performance of neural language models while faithfully attributing classic model decisions to training data. Among other use cases, such attribution enables model decisions to be contested based on responsible training instances. Compared to prior work, our approach achieves higher coverage and correctness in identifying which training data to remove to change a model decision. To reproduce findings, our source code is online at: https://github.com/SamSoup/WrapperBox.

Comment: Dear BlackBoxNLP organizers, We submitted this paper to the ARR June cycle and our AC/meta-reviewer indicated that this work is a good fit for BlackBox NLP, albeit with some revisions needed in response to concerns raised by reviewers. We briefly summarize here the main concerns and our plan to address them. Reviewer nEDh noted a lack of novelty, but as the AC and other reviewers pointed out, this can be addressed with minor revisions. While wrapper box is inspired by cited prior works, we 1) clearly note optimizations in space and time complexity in the Related Works Section (line 227-238), 2) generalize to a suite of classic models (beyond just kNN), and 3) propose a novel application on algorithmic recourse using white, case-based models. As suggested by the AC esWo and Reviewer nEDh, we plan to re-frame our main contribution as “not as proposing the concept of white-box models operating on black-box model representations, but rather as an in-depth analysis of existing methods”, which is deemed important by both reviewers. Reviewer DdGq noted a lack of clarity in defining faithfulness and what is “interpretable by design”. Our discussion around faithfulness (Section 3) mainly refers to the concept of completeness grounded in prior work (Gu et al, 2023), which we plan to incorporate in the next revision for better theoretical grounding. As suggested, we will reframe our work as building “(training-data-)attributable-by-design” models, which the reviewer believes is an important novel contribution. Reviewer teUQ noted more minor concerns around reported results and baselines. We note that our wrapper boxes are not intended to outperform the original neural model but rather aims to provide faithful sample attributions (for the classic wrapper models) while being competitive with neural performance (lines 88-94). For the comparisons with baselines from Yang (2023), we clearly note that their approach cannot extend to our classic case-based models (lines 549 - 554), and no prior work has reported using case-based models for the task of minimum subset selection (Section 6). Thank you for considering our paper. References: [Peijian Gu, Yaozong Shen, Lijie Wang, Quan Wang, Hua Wu, and Zhendong Mao. 2023. IAEval: A Comprehensive Evaluation of Instance Attribution on Natural Language Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 11966–11977, Singapore. Association for Computational Linguistics.](https://aclanthology.org/2023.findings-emnlp.801/)

Paper Link: https://openreview.net/forum?id=s8xL5e109H

Submission Number: 1

Loading