Transformers are Provably Optimal In-context Estimators for Wireless Communications

Vishnu Teja Kunde; Vicram Rajagopalan; Chandra Shekhara Kaushik Valmeekam; Krishna Narayanan; Jean-Francois Chamberland; Dileep Kalathil; Srinivas Shakkottai

Transformers are Provably Optimal In-context Estimators for Wireless Communications

Vishnu Teja Kunde, Vicram Rajagopalan, Chandra Shekhara Kaushik Valmeekam, Krishna Narayanan, Jean-Francois Chamberland, Dileep Kalathil, Srinivas Shakkottai

Published: 22 Jan 2025, Last Modified: 22 Apr 2025AISTATS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Transformer can provably learn to estimate in-context

Abstract: Pre-trained transformers exhibit the capability of adapting to new tasks through in-context learning (ICL), where they efficiently utilize a limited set of prompts without explicit model optimization. The canonical communication problem of estimating transmitted symbols from received observations can be modeled as an in-context learning problem: Received observations are a noisy function of transmitted symbols, and this function can be represented by an unknown parameter whose statistics depend on an unknown latent context. This problem, which we term in-context estimation (ICE), has significantly greater complexity than the extensively studied linear regression problem. The optimal solution to the ICE problem is a non-linear function of the underlying context. In this paper, we prove that, for a subclass of such problems, a single-layer softmax attention transformer (SAT) computes the optimal solution of the above estimation problem in the limit of large prompt length. We also prove that the optimal configuration of such a transformer is indeed the minimizer of the corresponding training loss. Further, we empirically demonstrate the proficiency of multi-layer transformers in efficiently solving broader in-context estimation problems. Through extensive simulations, we show that solving ICE problems using transformers significantly outperforms standard approaches. Moreover, just with a few context examples, it achieves the same performance as an estimator with perfect knowledge of the latent context.

Submission Number: 529

Loading