Abstract: The high cost of obtaining high-quality annotated data for in-context learning (ICL) has motivated the development of methods that use self-generated annotations in place of ground truths. While these approaches have shown promising results in few-shot settings, they are generally not scalable to many-shot scenarios. In this work, we investigate ICL with self-generated examples using a framework analogous to traditional semi-supervised learning, consisting of annotation generation, demonstration selection, and in-context inference. Following this framework, we propose a simple baseline that outperforms ground-truth ICL in zero-shot, few-shot, and many-shot settings. Remarkably, we observe a \textit{scaling law} on this simple baseline, where optimal performance is achieved with over 1,000 demonstrations. To fully exploit this many-shot capability of semi-supervised ICL, we introduce IterPSD, an iterative annotation approach that incorporates iterative refinement and curriculum pseudo-labeling concepts from semi-supervised learning, achieving up to 6.8% additional gains in classification tasks.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Test-Time Computing, Large Language Models, Unlabeled Data, NLP in resource-constrained settings, data-efficient training, prompting, scaling, few-shot generation
Contribution Types: NLP engineering experiment, Approaches to low-resource settings
Languages Studied: English, Bemba, Tuvan, Venetian, Sardinian, Faroese
Submission Number: 4983
Loading