Benchmarking the Ability of Large Language Models to Reason about Event Sequences

Benchmarking the Ability of Large Language Models to Reason about Event Sequences

ACL ARR 2024 June Submission4719 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The ability to reason about events and their temporal relations is a key aspect in Natural Language Understanding. In this paper, we investigate the ability of Large Language Models to resolve temporal references with respect to longer event sequences. Given that events rarely occur in isolation, it is crucial to determine the extent to which Large Language Models can reason about longer sequences of events. Towards this goal, we introduce a novel synthetic benchmark dataset comprising of 2200 questions to test the abilities of LLMs to reason about events using a Question Answering task as proxy. We compare the performance of 4 state of the art LLMs on the benchmark, analyzing their performance in dependence of the length of the event sequence considered as well as of the explicitness of the temporal reference. Our results show that, while the benchmarked LLMs can answer questions over event sequences with a handful of events and explicit temporal references successfully, performance clearly deteriorates with larger event sequence length and when temporal references get less explicit.

Paper Type: Long

Research Area: Question Answering

Research Area Keywords: logical reasoning, reasoning, question generation, interpretability

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources

Languages Studied: English

Submission Number: 4719

Loading