\section{Discussion}\label{sec:discussion}

\paragraph{Attention Formulation.} 
In this paper, our attention formulation in Definition~\ref{def:forward_comp} exactly matches the softmax attention in the traditional notation system~\cite{vsp+17}, with only some basic notational differences. Specifically, recalling Definition~\ref{def:forward_comp}, we compute the query-key attention matrix as 
%\begin{align*}
$
D^{-1} \underbrace{\exp(X_{\ell} Q K^\top X_{\ell}^\top)}_{:=A}
$, 
%\end{align*}
where $D^{-1} A$ recovers the computation $\mathrm{Softmax}(\frac{\wt{Q} \wt{K}^\top}{\sqrt{d}})$ (with $\wt{Q} := X_{\ell} W_Q$, $\wt{K} := X_{\ell} W_K$) in~\cite{vsp+17}. 

The key difference is that we use $Q$ and $K$ to denote $W_Q$ and $W_K$, and we use $A$ to denote the numerator part of the softmax computation in each row, while $D^{-1}$ normalizes each row. This means that our theoretical result is highly practical, with perfect alignment to the Transformers used in real LLMs.

\paragraph{Generalization to Multi-Layer Attention.} 
Our main result in Theorem~\ref{thm:main_informal} can be easily generalized to the multilayer case. To show this, we first consider the recursive attention computation in Definition~\ref{def:forward_comp}:
\begin{align*}
%$
     X_{\ell+1} \gets D^{-1} \exp(X_{\ell} Q K^\top X_{\ell}^\top) X_{\ell} V,
%$
\end{align*}
where each layer computes its output based on the previous layer’s input and the weight matrices.

In this paper, our result states that given any arbitrary $X_\ell$ we treat the weights $QK^\top$ and $V$ as variables, and we can output a good approximation of $X_{\ell+1}$ denoted as (see
Definition 1.2). In another work~\cite{dsxy23}, they treat the input $X_\ell$ as a variable and study the training. Since our formulation and algorithm treat $X_\ell$ as an input and do not assume anything
specific about its origin, and we can directly combine our result with attention training in~\cite{dsxy23}, our results apply to any layer in the network. Therefore, our methods naturally extend to multi-layer attention by applying them iteratively at each layer.
 
\paragraph{Justification of Assumptions.} 
In this work, our goal is to design an efficient algorithm that can be applied to a broader range of modern transformer architectures. Consequently, our method does not rely on strict assumptions, requiring only assumptions on good initialization points of $x_0$ and $y_0$ (see Definition~\ref{def:assumptions}) and on the regularization term 
%\begin{align*}
$
    \| (W \otimes I) (A_1 \otimes A_2)x \|_2^2 + \| W A_3 y \|_F^2
$
%\end{align*}
in Definition~\ref{def:reg_term}. 

Both assumptions can be easily satisfied in practice. Specifically, the first assumption can be met by spending additional effort in selecting a suitable initialization point, while the second is a standard practice in attention optimization~\cite{gsy23_hyper,llr23} and widely accepted in the broader field of optimization. These assumptions are also weaker than those in previous works, as we do not rely on conditions such as $d = O(\log n)$, $d = o(\log^2 n)$, or bounded entry assumptions as in~\cite{as23,zhdk23}, nor do we overly simplify the problem as in~\cite{syz23,gsy23_hyper,dls23}.