Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval

TMLR Paper1456 Authors

09 Aug 2023 (modified: 17 Sept 2024)Withdrawn by AuthorsEveryoneRevisionsBibTeXCC BY 4.0
Abstract: When provided with sufficient explanatory context, smaller Language Models have been shown to exhibit strong reasoning ability on challenging short-answer question-answering tasks where the questions are unseen in training. We evaluate two methods for further improvement in this setting. Both methods focus on combining rationales generated by a larger Language Model with longer contexts created from a multi-hop dense retrieval system. The first method ($\textit{RR}$) involves training a Rationale Ranking model to score both generated rationales and retrieved contexts with respect to relevance and truthfulness. We then use the scores to derive combined contexts from both knowledge sources using a number of combinatory strategies. For the second method ($\textit{RATD}$) we utilise retrieval-augmented training datasets developed by Hartill et al., 2023 to train a smaller Reasoning model such that it becomes proficient at utilising relevant information from longer text sequences that may be only partially evidential and frequently contain many irrelevant sentences. We find that both methods significantly improve results. Our single best Reasoning model materially improves upon strong comparable prior baselines for unseen evaluation datasets (StrategyQA 58.9 $\rightarrow$ 61.7 acc., CommonsenseQA 63.6 $\rightarrow$ 72.7 acc., ARC-DA 31.6 $\rightarrow$ 52.1 F1, IIRC 25.5 $\rightarrow$ 27.3 F1) and a version utilising our prior knowledge of each type of question in selecting a context combination strategy does even better. Our proposed models also generally outperform direct prompts against much larger models (BLOOM 175B and StableVicuna 13B) in both few-shot chain-of-thought and standard few-shot settings.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Updated to incorporate additional requests from reviewers. Added to last paragraph of section 2.1 for reviewer rm5n: "We note that robust examination of rationale quality is presently challenging to perform and believe research into automated methods in this area represents a promising future direction." Updated for reviewer XoeM: Updated to use clearer placeholders in prompt templates per suggestion. prior revision detail unchanged: Notably, claims have been scoped more carefully, statistical significance test results are incorporated, new figures have been added, further description of Hartill et al 2023 has been added, table captions clarified and readability has generally been improved throughout. Text in experimental results has been reworked to better highlight that both RR and RATD methods in isolation from each other significantly improve results.
Assigned Action Editor: ~Ahmad_Beirami1
Submission Number: 1456
Loading