Keywords: in-context learning, autoregressive processes, generalization, non-i.i.d. learning theory
Abstract: In this paper, we derive generalization results for next-token risk minimization in autoregressive processes of unbounded order. Our starting point is to relate the empirical loss to the denoising loss, which requires no additional assumptions compared to fixed-order Markovian models. We then show that, under a mixing or rephrasability condition on the data-generating process and assuming a stable hypothesis class, the out-of-sample generalization error concentrates around the denoising error. These results characterize sample complexity in terms of the number of tokens, rather than the number of i.i.d. sequences. As a primary application, we interpret in-context learning as a special case of autoregressive prediction and derive sample complexity bounds under similar conditions. Importantly, the properties of individual in-context tasks determine the generalization rates, without requiring assumptions on mixture processes. This perspective suggests that in-context learning can exploit the task decomposition to learn efficiently.
Submission Number: 40
Loading