Recipe-MPR: A Test Collection for Evaluating Multi-aspect Preference-based Natural Language Retrieval

Haochen Zhang; Anton Korikov; Parsa Farinneya; Mohammad Mahdi Abdollah Pour; Manasa Bharadwaj; Ali Pesaranghader; Xi Yu Huang; Yi Xin Lok; Zhaoqi Wang; Nathan Jones; Scott Sanner

Recipe-MPR: A Test Collection for Evaluating Multi-aspect Preference-based Natural Language Retrieval

Haochen Zhang, Anton Korikov, Parsa Farinneya, Mohammad Mahdi Abdollah Pour, Manasa Bharadwaj, Ali Pesaranghader, Xi Yu Huang, Yi Xin Lok, Zhaoqi Wang, Nathan Jones, Scott Sanner

Published: 01 Jan 2023, Last Modified: 18 May 2025SIGIR 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The rise of interactive recommendation assistants has led to a novel domain of natural language (NL) recommendation that would benefit from improved multi-aspect reasoning to retrieve relevant items based on NL statements of preference. Such preference statements often involve multiple aspects, e.g., "I would like meat lasagna but I'm watching my weight". Unfortunately, progress in this domain is slowed by the lack of annotated data. To address this gap, we curate a novel dataset which captures logical reasoning over multi-aspect, NL preference-based queries and a set of multiple-choice, multi-aspect item descriptions. We focus on the recipe domain in which multi-aspect preferences are often encountered due to the complexity of the human diet. The goal of publishing our dataset is to provide a benchmark for joint progress in three key areas: 1) structured, multi-aspect NL reasoning with a variety of properties (e.g., level of specificity, presence of negation, and the need for commonsense, analogical, and/or temporal inference), 2) the ability of recommender systems to respond to NL preference utterances, and 3) explainable NL recommendation facilitated by aspect extraction and reasoning. We perform experiments using a variety of methods (sparse and dense retrieval, zero- and few-shot reasoning with large language models) in two settings: a monolithic setting which uses the full query and an aspect-based setting which isolates individual query aspects and aggregates the results. GPT-3 results in much stronger performance than other methods with 73% zero-shot accuracy and 83% few-shot accuracy in the monolithic setting. Aspect-based GPT-3, which facilitates structured explanations, also shows promise with 68% zero-shot accuracy. These results establish baselines for future research into explainable recommendations via multi-aspect preference-based NL reasoning.

Loading