Scaling Laws for Many-Shot In-Context Learning with Self-Generated Annotations

Zhengyao Gu; Henry Peng Zou; Aiwei Liu; Yankai Chen; Weizhi Zhang; Philip S. Yu

Scaling Laws for Many-Shot In-Context Learning with Self-Generated Annotations

Zhengyao Gu, Henry Peng Zou, Aiwei Liu, Yankai Chen, Weizhi Zhang, Philip S. Yu

Published: 10 Jun 2025, Last Modified: 10 Jun 2025LCFM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: in-context learning, long-context LM

Abstract: The high cost of obtaining high-quality annotated data for in-context learning (ICL) has motivated the use of self-generated annotations as a substitute for ground-truth labels. While such methods have shown promise in few-shot settings, their effectiveness in many-shot scenarios remains underexplored. To address this gap, we propose a simple baseline, Naive-SemiICL, which follows a three-step framework—annotation generation, demonstration selection, and in-context inference—and demonstrates clear scaling trends across both discriminative and generative tasks. Naive-SemiICL outperforms few-shot ICL at various ground truth data budgets, notably surpassing 16-shot baselines by 9.94\% across 16 tasks on GPT-4o-mini. We further introduce IterPSD, an annotation method that iteratively improves pseudo-annotation quality by augmenting its prompt with self-annotated examples. IterPSD yields additional 6.8\% gains on 5 classification tasks compared to Naive-SemiICL. Code is available at: \textcolor{blue}{\url{https://anonymous.4open.science/r/semi-supervised-icl-FA07}}

Submission Number: 20

Loading