Track: long paper (up to 8 pages)
Keywords: Foundation Model, Prompt Tuning, Transformer, Universal Approximation, Memory Capacity, Computational Efficiency, Fine-Grained Complexity
TL;DR: We study the statistical and computational limits of prompt tuning in single-layer, single-head transformer, showing it is universal seq2seq approximator and, supports nearly linear efficient inference.
Abstract: We investigate the statistical and computational limits of prompt tuning for transformer-based foundation models.
Our key contributions are prompt tuning on *single-head* transformers with only a *single* self-attention layer:
(i) is universal, and (ii) supports efficient (even nearly-linear time) algorithms under the Strong Exponential Time Hypothesis (SETH).
Statistically,
we prove that prompt tuning on such simplest possible transformers are universal approximators for sequence-to-sequence Lipschitz functions.
In addition, we provide an exponential-in-$dL$ and -in-$(1/\epsilon)$ lower bound on the required soft-prompt tokens for prompt tuning to memorize any dataset with 1-layer, 1-head transformers.
Computationally, we identify a phase transition in the efficiency of prompt tuning, determined by the norm of the *soft-prompt-induced* keys and queries, and provide an upper bound criterion.
Beyond this criterion, no sub-quadratic (efficient) algorithm for prompt tuning exists under SETH.
Within this criterion,
we showcase our theory by proving the existence of almost-linear time prompt tuning inference algorithms.
These fundamental limits provide important necessary conditions for designing expressive and efficient prompt tuning methods for practitioners.
Submission Number: 78
Loading