Keywords: reasoning, question answering
TL;DR: Exploring alternative to left-to-right factorization, we show that different reasoning orders can have impact on LLM performance on tasks like reasoning, commonsense, and truthfulness, with insights into when each factorization is most beneficial.
Abstract: Language models usually use left-to-right (L2R) autoregressive factorization.
However, L2R factorization may not always be the best inductive bias.
Therefore, we investigate whether alternative factorizations of the text distribution could be beneficial in some tasks.
We investigate right-to-left (R2L) training as a compelling alternative, focusing on multiple-choice questions (MCQs) as a test bed for knowledge extraction and reasoning. Through extensive experiments across various model sizes (2B-8B parameters) and training datasets, we find that L2R is not always preferred over R2L models on MCQ benchmarks, especially on logical reasoning, commonsense understanding, and truthfulness assessment tasks. Our analysis reveals that this performance difference may be fundamentally linked to multiple factors including calibration, computability, and directional conditional entropy.
We ablate the impact of these factors through controlled simulation studies using arithmetic tasks, where the impacting factors can be better disentangled.
Our work demonstrates that exploring alternative factorizations of the text distribution can lead to improvements in LLM capabilities and provides theoretical insights into optimal factorization towards approximating human language distribution, and when each reasoning order might be more advantageous.
Supplementary Material: pdf
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4279
Loading