IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Large Language Models in E-commerce

IntentionQA: A Benchmark for Evaluating Purchase Intention Comprehension Abilities of Large Language Models in E-commerce

ACL ARR 2024 April Submission703 Authors

16 Apr 2024 (modified: 08 Jun 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Enhancing Large Language Models' (LLMs) ability to understand purchase intentions in E-commerce scenarios is crucial for their effective assistance in various downstream tasks. However, previous approaches that distill intentions from LLMs often fail to generate meaningful and human-centric intentions applicable in real-world E-commerce contexts. This raises concerns about the true comprehension and utilization of purchase intentions by LLMs. In this paper, we present IntentionQA, a double-task multiple-choice question answering benchmark to evaluate LLMs' comprehension of purchase intentions in E-commerce. Specifically, LLMs are tasked to infer intentions based on purchased products and utilize them to predict additional purchases. IntentionQA consists of 4,375 carefully curated problems across three difficulty levels, constructed using an automated pipeline to ensure scalability on large E-commerce platforms. Human evaluations demonstrate the high quality and low false-negative rate of our benchmark. Extensive experiments across 19 language models show that they still struggle with certain scenarios, such as understanding products and intentions accurately, jointly reasoning with products and intentions, and more, in which they fall far behind human performances.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: benchmarking, automatic creation and evaluation of language resources, NLP datasets, automatic evaluation of datasets, evaluation

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 703

Loading