Abstract: Large generative language models (GLM) provide a versatile tool for solving a wide variety of natural processing tasks. GLM responses, though, are provided in the form of text, without an indication of the model's confidence in the answer. This limits the usability of these models on high-risk applications where decisions made based on an incorrect answer can have severe consequences. In this work, we focus on the problem of generating reliable class posterior distributions for text classification tasks like sentiment, news category and intent classification. These posteriors can be used for decision making and as interpretable scores for the user. We show that the naive approach for computing posteriors based on the token posteriors produced by the GLM results in extremely poor posteriors. We then explore different adaptation approaches for improving the quality of posteriors, focusing on low resource scenarios where a small amount of data is available for adaptation. We show that parameter-efficient supervised fine-tunning (SFT), while providing large gains in terms of decision quality, produces suboptimal posteriors due to overfitting. To address this problem, we propose an approach that combines SFT and post-hoc calibration (PHC) using a three-stage training strategy, improving the quality of both posteriors and categorical decisions.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: This revision addresses the comments from the reviewers. The main changes are highlighted in red in the pdf. Here we summarize the main changes:
- Added more references and further justification of our choice of baseline systems to compare against in Section 2.
- Extended the explanation in Section 4 to further motivate the choice of the cross-entropy metric for evaluation of posterior probabilities.
- Added more regularization and calibration methods in the experiments. Importantly, the main conclusions from the paper remain the same as before, as none of the additional experiments outperformed our prior best approach.
- Added results for some additional metrics (Brier score and calibration metrics) in the appendices.
A detailed description of the changes can be found in a comment below.
Assigned Action Editor: ~Hsuan-Tien_Lin1
Submission Number: 5498
Loading