Keywords: reasoning, CoT, fact verifcation, entailment, NLI
TL;DR: This paper introduces a novel framework for evaluating LLM's NLI reasoning in fact verification, which traces the specific inference types involved, applied through both manual and automated analyses of model reasoning paths.
Abstract: The reasoning traces generated by Large Language Models (LLMs) are increasingly used to improve final predictions, enable reinforcement learning based on reasoning trace correctness, and justify model outputs to users. Their recognized utility spurred a line of works on evaluating LLM reasoning quality. However, such current reasoning evaluation methods are typically generic and do not shed light on the different reasoning types that may be required for various complex tasks. In this paper, we investigate reasoning quality for the prominent task of *Fact Verification*, where a model should determine whether a given claim is entailed by a reference source text, a fundamental process known as Natural Language Inference (NLI). Specifically, we propose a novel evaluation framework that considers the prominent types of inference steps involved in NLI reasoning: hypothesis *decomposition* into individual facts, followed by source *attribution* and *entailment* decision for each fact, and finally *aggregation* of fact level decisions into the final entailment classification. Our protocol introduces fine-grained metrics to assess both the existence (whether a step was performed) and the quality (how well it was performed) for each inference type. Following this framework, we first conduct a meticulous manual evaluation of six prominent LLMs, and then scale the evaluation using LLM-as-a-Judge. Our analysis reveals several insights, including: (1) a significant positive correlation exists between the quality of the reasoning trace and the correctness of the final prediction; (2) models often omit necessary reasoning steps, leading to incomplete justifications; and (3) guiding the LLM towards a systematic reasoning trace based on our framework often improves the quality of both the reasoning trace and the overall entailment classification, specifically for "non-reasoning" models. Overall, our work provides a more diagnostic and nuanced approach to understanding and evaluating LLM reasoning trace, demonstrated specifically for NLI reasoning in fact verification, proposing insights for future improvements in reasoning quality and its downstream usage.
Primary Area: generative models
Submission Number: 19038
Loading