Abstract: Large language models (LLMs) often generate convincing, fluent explanations. However, different from humans, they often generate $\textit{inconsistent}$ explanations on different inputs. For example, an LLM may generate the explanation "$\textit{all birds can fly}$" when answering the question "$\textit{Can sparrows fly?}$" but meanwhile answer "$\textit{no}$" to the related question "$\textit{Can penguins fly?}$". Explanations should be consistent across related examples so that they allow a human to simulate the LLM's decision process on multiple examples.We propose $\textbf{explanation-consistency finetuning}$ (EC-finetuning), a method that adapts LLMs to generate more consistent natural-language explanations on related examples. EC-finetuning involves finetuning LLMs on synthetic data that is carefully constructed to contain consistent explanations. Across a variety of question-answering datasets in various domains, EC-finetuning yields a $\textbf{10.0\%}$ relative explanation consistency improvement on four finetuning datasets, and generalizes to seven out-of-distribution datasets not seen during finetuning ($\textbf{+4.5\%}$ relative). We will make our code available for reproducibility.
Paper Type: short
Research Area: Interpretability and Analysis of Models for NLP
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
0 Replies
Loading