Why does in-context learning fail sometimes? Evaluating in-context learning on open and closed questions.

ACL ARR 2024 December Submission1401 Authors

16 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

This study investigates the impact of context relevancy on the performance of in-context learning. To quantify that impact, we created a novel database of open-form questions, each paired with different contexts of various relevancy. Next, we perform manual grading (introducing six-fold redundancy to minimize the impact of individual graders), measuring the quality of generated responses in several dimensions. We show that counterintuitively, in many cases, less relevant contexts can perform as well as, or even better than, more relevant ones. By controlling for task novelty and question difficulty, we demonstrate that this phenomenon is particularly pronounced for open-form questions and questions with high perceived novelty or difficulty. This result reveals a fundamental difference in how large language models process closed-form and open-form questions. Furthermore, our findings raise critical questions about optimal context selection for large language models, particularly in open-response scenarios -- a question critical when building Retrieval-Augmented Generation (RAG) systems.

Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: human evaluation, few-shot generation, analysis, text-to-text generation, retrieval-augmented generation
Contribution Types: NLP engineering experiment, Reproduction study, Data resources
Languages Studied: English
Submission Number: 1401
Loading