\section{Related Work}\label{sec:related_work}


\paragraph{Attention}

Transformer models, proposed by \cite{vsp+17}, revolutionized attention computation with their self-attention mechanism. This allowed for parallel processing of input sequences and captured long-range dependencies more effectively than previous recurrent architectures.
After that, there has been a substantial body of work on attention computation \citep{dms23, as23, zhdk23, clp+21, lsz23, bsz23, kkl20}. Notably, recent research by \cite{zhdk23, clp+21, kkl20} employs Locality Sensitive Hashing (LSH) techniques to approximate attention mechanisms. In particular, \cite{zhdk23} introduces $\mathsf{KDEformer}$, an efficient algorithm for approximating dot-product attention. This algorithm provides provable spectral norm bounds and outperforms various pre-trained models. Additionally, current research explores both static and dynamic approaches to calculating attention, as evidenced by the works of \cite{bsz23} and \cite{as23}. Furthermore, \cite{lsz23} delves into the regularization of hyperbolic regression problems, which involve functions like $\exp$, $\sinh$, and $\cosh$. \cite{dms23} proposes randomized and deterministic algorithms for reducing the dimensionality of attention matrices in LLMs, achieving high accuracy while significantly reducing feature dimensions.

Additionally, numerous studies have attempted to analyze theoretical attention from the perspectives of optimization and convergence \citep{llr23, gms23, szks21, zkv+20}. \cite{llr23} investigated how transformers acquire knowledge about word co-occurrence patterns. \cite{gms23} focused on studying regression problems inspired by neural networks that employ exponential activation functions. \cite{szks21} analyzed why models occasionally prioritize significant words and explained how the attention mechanism evolves during the training process. \cite{zkv+20} demonstrated that the presence of a heavy-tailed noise distribution contributes to the bad performance of stochastic gradient descent (SGD) compared to adaptive methods. 

\paragraph{Theoretical LLMs}

There are numerous amount of works focusing on the theoretical aspects of LLMs. In \cite{ryw+19}, the syntactic representations of the attention matrix and the individual word embeddings are presented, together with the mathematical justification of elucidating the geometrical properties of these representations. \cite{hm19} introduces a structural probe that analyzes, under the linear transformation of a word representation space of a neural network, whether or not syntax trees are embedded. 

\cite{clmy21,llh+23,rsm+23,kmh+20} study the optimization of LLMs. \cite{clmy21} proposes a new algorithm called ZO-BCD. It has favorable overall query complexity and a smaller computational complexity in each iteration. \cite{llh+23} creates a simple and scalable second-order optimizer, called Sophia. In different parts of the parameter, Sophia adapts to the curvature. This may be strongly heterogeneous for language modeling tasks. The bound of the running time does not rely on the condition number of the loss.

Other theoretical LLM papers study the knowledge and skills of LLMs. \cite{wwz+22} analyzes distinct ``skill" neurons, which are regarded as robust indicators of downstream tasks when employing the process of soft prompt-tuning, as discussed in \cite{ll21}, for language models. \cite{ddh+21} finds a positive relationship between the activation of these neurons and the expression of their corresponding facts, through analyzing BERT. Simultaneously, \cite{byks22} employs a fully unsupervised approach to extract latent knowledge from a language model's internal activations. In addition, \cite{hbkg23} and \cite{mbab22} show that in the feed-forward layers of pre-trained models, language models localize knowledge. \cite{xqp+22} explores the feasibility of selecting a specific subset of layers for modification and determining the optimal location for integrating a classifier. \cite{lyb+22} demonstrates that large trained transformers exhibit sparsity in their feedforward activations. Zero-th order algorithm for training LLM has been analyzed \citep{mgn+23,dlms23,zhl+23}.

A notable line of research is analyzing the theoretical limits of LLMs and discussing how to overcome these limitations. Recent works have shown that a wide range of LLM architectures fall into a weaker class of logical circuits~\cite{llzm24,lls+25_fourier,cll+24_rope,cll+25_mamba}, which resonates with similar results in other neural architectures~\cite{lls+25_graph,kll+25_tc}, and such limitation may be improved by chain-of-thought~\cite{llzm24} or positional encoding~\cite{ysw+25}. Another line of research shows that Transformers may not be able to learn the support set of some simple Boolean functions under gradient descent~\cite{cll+25_looped_transformer,cssz25,hzs+25,ks25} without the help of chain-of-thoughts. There are also works discussing the conditions deciding whether we can approximate Transformer computation efficiently, such as bounded entries~\cite{as23,as24,as24_tensor,as25,as25_rank}, statistical rates~\cite{hwl+24,hwsl24}, and model pruning~\cite{fa23,lls+25_pruning,gao2023fast}. These theoretical results extend to universal approximation~\cite{kzld22,cll+25_var,lhsl25,hlcw+25}, model tuning~\cite{hsk+24,hwg+24}, and in-context learning~\cite{wsh+24,whhz+25}.

\paragraph{LLMs Application and Evaluation}

Recently, there has been much interest in developing LLM-based systems for conversational AI and task-oriented dialogue, like Google's Meena chatbot \cite{r20}, Microsoft 365 Copilot \cite{s23}, Adobe firefly, Adobe Photoshop, GPT series \cite{rns+18,rwc+19,bmr+20,cha22,o23}, and BERT \cite{dclt18}. Moreover, numerous fine-tuning methods such as \cite{hsw+22,mwz24,cs25} appear in order to adapt models for different conversational tasks better.

Moreover, LLM evaluation is also a popular research area. Within the field of NLP, LLMs are evaluated based on natural language understanding \cite{bcl+23,lbl+22,lbr+23,cpk+23}, reasoning \cite{bhs+23,wqr+23,xlh+23}, natural language generation \cite{wlj+23,qzz+23,pd23,chbp23,cwj+23}, and multilingual tasks \cite{amc+23,aho+23,lnv+23,zag+23}. Robustness \cite{llgb23,whh+23,zpd+23}, ethics \cite{czl+23}, biases \cite{f23}, and trustworthiness \cite{hf23} are also important aspects. More specifically, the abilities of LLMs in social science \cite{dgg23,fr23,nkl+23}, mathematics \cite{as+23,dl23,wll+23,bce+23}, science \cite{cp23,ggl+23}, engineering \cite{bce+23,lxwz23,pmm+23,sm+23}, and medical applications \cite{clbj23,jgp+23} are evaluated. LLMs are also core for different modalities, including speech~\cite{clz+24,jws+24}, image~\cite{hja20,rbl+22,ccl+25_form} and video~\cite{openai_sora,ytz+24,csy25_vlfm}. Evaluation on these multi-modal aspects of language models includes image generation~\cite{lpl+24,cgh+25}, video generation~\cite{ghh+25,ghs+25_physical,ghs+25_text}, and multi-modal reasoning~\cite{xww+25,tzg+25}. 


\paragraph{Sketching}

Sketching is a powerful tool that is used to accelerate the performance of machine learning algorithms and optimization processes. The fundamental concept of sketching is to partition a large input matrix into a significantly smaller sketching matrix but still preserve the main characteristics of the original matrix. Therefore, the algorithms may work with the smaller matrix instead of the huge original, which leads to a substantial reduction in processing time. Many previous works have studied sketching, proposed sketching algorithms, and supported these algorithms with robust theoretical guarantees. For example, the Johnson-Lindenstrauss lemma is proposed by \cite{jl84}: it shows that under a certain high-dimensional space, projecting points to a lower-dimensional subspace may preserve the pairwise distances between these points. This mathematical property becomes the foundation of the development of faster algorithms for tasks such as nearest neighbor search. In addition, as explained in \cite{ac06}, the Fast Johnson-Lindenstrauss Transform (FJLT) introduces a specific family of structured random projections that can be applied to a matrix in input sparsity time.

More recently, sketching has been applied to many numerical linear algebra tasks, such as linear regression \citep{cw13,nn13}, 
online optimization problems \citep{rrs+21}, training neural networks \citep{szz21,xzz18,syz21,gqsw22,bpsw20}, reinforcement learning \citep{wzd+20,ssx23}, tensor decomposition \citep{s19,swz19_tensor,dsy23}, relational database \citep{qjs+22}, low-rank approximation \citep{bw14,mrs20,mm13,als+18,swz17}, distributed problems \citep{bwz16,wz16}, weighted low rank approximation \citep{rsw16,gsyz23,syyz23_weight}, CP decomposition \citep{ms21}, regression inspired by softmax \citep{lsz23,gsy23_hyper,ssz23,dls23}, and Kronecker product regression \citep{rsz22}.
