Getting MoRE out of Mixture of Language Model Reasoning Experts

Chenglei Si; Weijia Shi; Chen Zhao; Luke Zettlemoyer; Jordan Lee Boyd-Graber

Getting MoRE out of Mixture of Language Model Reasoning Experts

Chenglei Si, Weijia Shi, Chen Zhao, Luke Zettlemoyer, Jordan Lee Boyd-Graber

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Question Answering

Keywords: Large Language Models, Reasoning, Prompting, Generalization, Calibration, Interpretability

TL;DR: We propose a Mixture-of-Reasoning-Experts (MoRE) framework that ensembles diverse specialized language models for more generalizable and interpretable reasoning.

Abstract: While recent large language models (LLMs) improve on various question answering (QA) datasets, it remains difficult for a single model to generalize across question types that require distinct reasoning abilities. We provide empirical evidence that state-of-the-art LLMs suffer from poor generalizability on reasoning types beyond those seen in the prompt. To remedy this, we propose a Mixture-of-Reasoning-Experts (MORE) framework that ensembles diverse specialized language models. We specialize the backbone language model with prompts optimized for different reasoning categories, including factual, multihop, mathematical, and commonsense reasoning. Our key insight is to leverage agreement among the specialized experts to select the best answer for each question, or to abstain from answering. This gives MORE higher accuracy than any single specialized model on a collection of 12 QA datasets from four reasoning types. Beyond generalizability, the interpretable design of MORE improves selective question answering results compared to baselines without incorporating inter-expert agreement. This framework is also more interpretable and useful to human consumers of QA outputs. Our human study confirms that presenting expert predictions and the answer selection process helps annotators more accurately calibrate when to trust the system’s output. We release all code and data to facilitate future work.

Submission Number: 2088

Loading