In Search of the Long-Tail: Systematic Generation of Long-Tail Knowledge via Logical Rule Induced Search

23 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: long-tail, evaluation, generation, large language model, symbolic rule, reasoning
TL;DR: We propose a framework that systematically generates long-tail knowledge statements for evaluating large language models' performance on the long-tail distribution..
Abstract: Since large language models (LLMs) have approached human-level performance on many tasks, it has become increasingly harder for researchers to find tasks that are still challenging to the models. Failure cases usually come from the long-tail distribution -- data to which an oracle language model could assign a probability on the lower end of its distribution. Systematically finding evaluation data in the long-tail distribution is important, but current methodology such as prompt engineering or crowdsourcing are insufficient because coming up with long-tail examples is also hard for humans due to our cognitive bias. In this paper, we propose a Logic-Induced-Knowledge-Search (LINK) framework for systematically generating long-tail knowledge statements. Grounded by a symbolic logic rule, we search for long-tail values for each variable of the rule by first prompting a large language model, then verifying the correctness of the values with a critic, and lastly pushing for the long-tail distribution with a reranker. Using this framework we construct a dataset, Logic-Induced-Long-Tail (LINT [https://doi.org/10.5281/zenodo.8384878]), consisting of 200 symbolic rules and 40K knowledge statements spanning across four different domains. Human annotations find that 89% of the statements in LINT are factually correct. In contrast, ChatGPT and GPT4 struggle with directly generating long-tail statements under the guidance of logic rules, each only getting 61% and 79% of their statements correct. Moreover, their ``long-tail" generations in fact fall into the higher likelihood range, and thus are not really long-tail. Our findings suggest that LINK is effective for generating data in the long-tail distribution while enforcing quality. To demonstrate how the community can utilize LINT for systematically evaluating LLMs' capabilities in the long-tail distribution, we challenge the models with a simple entailment classification task using samples from LINT. We find that ChatGPT and GPT4 performances drop by 2% and 4% when reasoning on long-tail knowledge statements compared to on head distribution statements. We hope our work can inspire future research on generating evaluation data in the long-tail distribution.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6959
Loading