What are the Desired Characteristics of Calibration Sets? Identifying Correlates on Long Form Scientific SummarizationDownload PDF

21 Dec 2022 (modified: 21 Dec 2022)OpenReview Anonymous Preprint Blind SubmissionReaders: Everyone
Keywords: natural language processing, summarization, contrastive learning, biomedical informatics
TL;DR: We aim to uncover the underlying characteristics of effective candidate sets for both relevance and faithfulness calibration.
Abstract: Summarization models are typically trained to maximize the likelihood of a single reference (MLE). As a consequence, during inference, the probabilities assigned to model generations are often poorly calibrated to quality metrics. To address this, after an initial MLE step, recent work has added a calibration step, which exposes a model its own ranked outputs to improve relevance or, in a separate line of work, contrasts positive and negative sets to improve faithfulness. While effective, much of the work has focused on how to generate and optimize these sets. Less is known about why one setup is more effective than another. In this work, we aim to uncover the underlying characteristics of effective candidate sets for both relevance and faithfulness calibration. For each training instance, we form a large, diverse pool of candidates and systematically vary the subsets used for calibration fine-tuning. Each selection strategy targets distinct aspects of the sets, such as lexical diversity or the size of the gap between positive and negatives. On three diverse scientific long-form summarization datasets (spanning biomedical, clinical, and chemical domains), we find, among others, that faithfulness calibration is optimal when the negative sets are extractive and more likely to be generated, whereas for relevance calibration, the metric margin between ranked candidates should be maximized and surprise minimized.
0 Replies

Loading