Neural Sandbox Framework for Classification: A Concept Based Method of Leveraging LLMs for Text Classification

Published: 01 Nov 2023, Last Modified: 12 Dec 2023R0-FoMo PosterEveryoneRevisionsBibTeX
Keywords: Large Language Model, Spurious Corelation, NLP, AI Alignment, Explainable AI
Abstract: We introduce a neural sandbox framework for text classification via self-referencing defined label concepts from a Large Language Model(LLM). The framework draws inspiration from the define-optimize alignment problem, in which the motivations of a model are described initially and then the model is optimized to align with these predefined objectives. In our case, we focus on text classification where we use a pre-trained LLM to convert text into vectors and provide it with specific concept words based on the dataset labels. We then optimize an operator, keeping the LLM frozen, to classify the input text based on how relevant it is to these concept operator words (cop-words). In addition to exhibiting explainable features, experiments with multiple text classification datasets and LLM models reveal that incorporating our sandbox network generally improves the accuracy and macro f1 when compared to a baseline. The framework, not only improves classification but also provides insights into the model's decision making based on the relevance scores of provided cop-words. We also demonstrated the framework's ability to generalize learned concepts and identify potential biases through spurious relations. However, we found that the model's incentives may not always align with human decisions.
Submission Number: 106
Loading