A Neural Sandbox Framework for Discovering Spurious Concpets in LLM Decisions

Mostafa Mushsharat; Nabeel Mohammed; Mohammad Ruhul Amin

A Neural Sandbox Framework for Discovering Spurious Concpets in LLM Decisions

Mostafa Mushsharat, Nabeel Mohammed, Mohammad Ruhul Amin

24 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Supplementary Material: zip

Primary Area: general machine learning (i.e., none of the above)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Large Language Model, Spurious Corelation, NLP, AI Alignment

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: We introduce a neural sandbox framework for text classification via self-referencing defined label concepts from an Large Language Model(LLM). The framework draws inspiration from the define-optimize alignment problem, in which the motivations of a model are described initially and then the model is optimized to align with these predefined objectives. In our case, we design our framework to perform text classification. We take a frozen LLM as a vector embedding generator for text and provide our framework with defined concept words based on the labels along with the input text. We then optimize an operator to classify the input text based on the relevance scores to the concept operator words(cop-words). In our experiments with multiple text classification datasets and LLM models, we find, incorporating our sandbox network generally improves the accuracy by a range of 0.12\% to 6.31\% in accuracy and 0.3\% to 8.82\% in macro f1 when compared to a baseline. The framework, not only serves as a classification tool but also as a descriptive tool for the model's decision of its prediction, based on the provided cop-words. Through further evaluations involving the injection of "foreign" cop-words, we showcase the sandbox framework's capacity to exhibit a coherent understanding of learned concepts and construct methodologies to discover potential spurious behaviors and biases within it. Despite witnessing results confirming our network's ability to capture domain knowledge, we show evidence that the model's secondary incentives do not match human decisions.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 9041

Loading