Are Machines Better at Slow Thinking? Unveiling Human-Machine Inference Gaps in Entailment Verification

Published: 11 Mar 2024, Last Modified: 15 Mar 2024LLMAgents @ ICLR 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: entailment, verification, CoT, rationales
TL;DR: We compare humans and LLMs on entailment verification task and release a new open-source model with strong performance on this
Abstract: Humans make numerous inferences in text comprehension to understand the meaning. This paper aims to understand the similarities and differences between humans and state-of-the-art Large Language Models (LLMs) in their ability to judge valid inferences. To this end, we leverage a comprehensively curated entailment verification benchmark that includes datasets from three NLP domains (NLI, contextual QA, and rationales) containing multi-sentence premises and requiring different types of knowledge. Our findings reveal LLMs’ superiority in multi-hop reasoning across extended contexts requiring slow thinking, while humans excel in simple deductive reasoning tasks. Using these insights, we introduce a fine-tuned Flan-T5 model that outperforms GPT-3.5 and rivals GPT-4, offering a superior open-source LLM for entailment verification. As a practical application, we showcase the efficacy of our finetuned model in enhancing the self-consistency in model-generated CoT rationales, resulting in a 6% performance boost on average across three multiple-choice question-answering datasets.
Submission Number: 102
Loading