KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval

Marah I Abdin; Suriya Gunasekar; Varun Chandrasekaran; Jerry Li; Mert Yuksekgonul; Rahee Ghosh Peshawaria; Ranjita Naik; Besmira Nushi

KITAB: Evaluating LLMs on Constraint Satisfaction for Information Retrieval

Marah I Abdin, Suriya Gunasekar, Varun Chandrasekaran, Jerry Li, Mert Yuksekgonul, Rahee Ghosh Peshawaria, Ranjita Naik, Besmira Nushi

Published: 16 Jan 2024, Last Modified: 14 Mar 2024ICLR 2024 posterEveryoneRevisionsBibTeX

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: language model evaluation, benchmark, constraint satisfaction, information retrieval, retrieval-augmented architecture

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

TL;DR: State-of-the-art LLMs struggle with answering constraint satisfaction queries for finding information. We contribute a dataset and an evaluation study to support progress in this emerging ability.

Abstract: We study the ability of state-of-the art models to answer constraint satisfaction queries for information retrieval (e.g., “a list of ice cream shops in San Diego”). In the past, such queries were considered as tasks that could only be solved via web-search or knowledge bases. More recently, large language models (LLMs) have demonstrated initial emergent abilities in this task. However, many current retrieval benchmarks are either saturated or do not measure constraint satisfaction. Motivated by rising concerns around factual incorrectness and hallucinations of LLMs, we present KITAB, a new dataset for measuring constraint satisfaction abilities of language models. KITAB consists of book-related data across more than 600 authors and 13,000 queries, and also offers an associated dynamic data collection and constraint verification approach for acquiring similar test data for other authors. Our extended experiments on GPT4 and GPT3.5 characterize and decouple common failure modes across dimensions such as information popularity, constraint types, and context availability. Results show that in the absence of context, models exhibit severe limitations as measured by irrelevant information, factual errors, and incompleteness, many of which exacerbate as information popularity decreases. While context availability mitigates irrelevant information, it is not helpful for satisfying constraints, identifying fundamental barriers to constraint satisfaction. We open source our contributions to foster further research on improving constraint satisfaction abilities of future models.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Primary Area: datasets and benchmarks

Submission Number: 1731

Loading