\section{Related Work}
{\bf Flow Matching.}
Flow Matching (FM) \citep{lhh+24} has recently gained prominence in generative modeling, particularly within the framework of Continuous Normalizing Flows (CNFs). FM offers a simulation-free approach to training CNFs by regressing vector fields along fixed conditional probability paths, thereby enhancing scalability and performance in generative tasks~\cite{flow_matching}. Building upon this foundation, \cite{tfm+23} developed Conditional Flow Matching (CFM), a family of simulation-free training objectives for CNFs. CFM facilitates conditional generative modeling and accelerates both training and inference processes. An exciting development in this area is the introduction of Rectified Flow, which refines flow-based methods by incorporating corrective adjustments to the learned vector fields, enabling more robust convergence and improved stability in generative modeling tasks. Rectified Flow not only enhances training efficiency but also synergizes effectively with other flow-matching methods, further extending the utility of FM in diverse applications. A notable variant within CFM, Optimal Transport Conditional Flow Matching (OT-CFM), approximates dynamic optimal transport in a simulation-free manner, leading to more efficient and stable training. Recent advancements in flow matching for generative modeling have introduced several innovative approaches. \cite{hppa24} proposed Wasserstein Flow Matching, extending traditional flow matching to families of distributions, enhancing its applicability in fields like computer graphics and genomics. \cite{ccl+25} incorporates special relativity constraints in flow matching. Moreover, numerous recent works \citep{xzc+22,dwb+23,pbd+23,wsd+23,wcz+23,wxz+24,cl24,kkn24,brsr24} have significantly inspired and influenced our work.

{\bf Diffusion Models.} 
Generative Models have long been a central topic in the field of Deep Learning~\citep{kw14,gpm+20,lwz+22,ccp+23}. Empowered by the recent advances in Vision Transformers~\citep{dbk+20,zlz21,px23,bnx+23}, diffusion models have gained unprecedented success in generative modeling, producing high fidelity visual contents and has applications in a wide range of real-world scenarios, such as image generation~\citep{sme20,rbl+22,cgh+25}, video generation~\citep{ytz+24,csy25,ghh+25,ghs+25_physical}, text editing~\citep{kzl+23,gpv+24,ghs+25_text}, e-commerce~\citep{wxf+23,zwx+24,lzw+24}. 
These approaches typically involve a forward process that systematically adds noise to an initial clean image and a corresponding reverse process that learns to remove noise step by step, thereby recovering the underlying data distribution in a probabilistic manner. Early works~\citep{se19,sme20,dvk22} established the theoretical foundations of this denoising strategy, introducing score-matching and continuous-time diffusion frameworks that significantly improved sample quality and diversity. Subsequent research has focused on more efficient training and sampling procedures~\citep{lzb+22,wzkz23,ssz+24_dit,ssz+24_pruning}, aiming to reduce computational overhead and converge faster without sacrificing image fidelity. Other lines of work leverage latent spaces to learn compressed representations, thereby streamlining both training and inference~\citep{rbl+22,hwsl24}. This latent learning approach integrates naturally with modern neural architectures and can be extended to various modalities beyond images, showcasing the versatility of diffusion processes in modeling complex data distributions. In parallel, recent researchers have also explored multi-scale noise scheduling and adaptive step-size strategies to enhance convergence stability and maintain high-resolution detail in generated content in \cite{lkw+24,fmzz24,rckc24,jzx+25,lyhz24}. On the other hand, \cite{rpcs23,hwsl24,wxhl24,hwl+24} explores the Diffusion Models theoretically, pointing out future directions of Diffusion Models.

\paragraph{Large Language Models.}Neural networks built upon the Transformer architecture~\citep{vsp+17} have swiftly risen to dominate modern machine learning approaches in natural language processing. Extensive Transformer models, trained on wide-ranging and voluminous datasets while encompassing billions of parameters, are often termed large language models (LLM) or foundation models~\citep{bha+21}. Representative instances include BERT~\citep{dclt19}, PaLM~\citep{cnd+22}, Llama~\citep{tli+23}, ChatGPT~\citep{chatgpt}, GPT4~\citep{o23}, among others. These LLMs have showcased striking general intelligence abilities~\citep{bce+23} in various downstream tasks. Numerous adaptation methods have been developed to tailor LLMs for specific applications, such as adapters~\citep{hsw+22,zhz+23,ghz+23,zjk+23,hsk+24,cs25}, calibration schemes~\citep{zwf+21,cpp+23}, multitask fine-tuning~\citep{gfc+21,vnr+23,zzj+23}, prompt optimization~\citep{gfc+21,lac+21,hwg+24}, scratchpad approaches~\citep{naa+21}, instruction tuning~\citep{ll21,chl+22,mkd+22}, symbol tuning~\citep{whl+23}, black-box tuning~\citep{ssy+22}, in-context learning~\citep{wws+22,wsh+24,whhz+25} and reinforcement learning from human feedback (RLHF)~\citep{owj+22}. Additional lines of research endeavor to boost model efficiency without sacrificing performance across diverse domains, for example, in \cite{lls+24_prune,llsz24_nn_tw,cll+24_ssm,chl+24_rope_grad,kll+24}. An emerging focus in LLMs is the inherent theoretical limitation of these models, including the infeasibility of efficient computation under sufficiently large model weight magnitudes~\citep{as23,as24,as24_tensor,as25,as25_rank}, circuit complexity~\citep{llzm24,lls+24_grok,cll+24_rope,lls+25}, the infeasibility to learn some Boolean functions under gradient descent~\citep{cssz25,hzs+25,ks25}, universal approximation~\cite{kzld22,cll+25_var,lhsl25,hlcw+25}, and in-context learning~\citep{wsh+24,whhz+25,cll+24_icl}.

{\bf Second Order Method.}
More recently, second-order methods have been applied to neural network optimization and used to solve a lot of problems. \cite{m10} introduced Hessian-free optimization, using conjugate gradients to approximately solve the Newton update. \cite{vp12} used a Krylov subspace descent method to directly approximate the Newton update. For natural gradient methods, which perform the steepest descent in the space of network outputs rather than parameters, \cite{a98} showed a connection to second-order optimization via the Fisher information matrix.  \cite{gm16} later extended K-FAC to convolutional neural networks. \cite{bgm22} combined natural gradient with trust region methods to further improve stability and performance. Despite these advances, second-order neural network optimization remains an active area of research, such as \cite{s19,dswy22_coreset,swy23,gswy23,gsy23_hyper,gsy23_coin,bsy23,dms23_spar,gms23,ssx23_nns,qss23_gnn,cls+24,cll+24_icl,lss+24_multi_layer,cll+24_rope,lsss24_dp_ntk,lss+24_relu,kll+25}. Open problems include improving the scalability of Hessian approximations, handling non-convex optimization landscapes, and automating hyper-parameter selection.


{\bf Roadmap.} In Section~\ref{sec:pre}, we introduce essential computational techniques and key definitions of flow matching and our NRFlow. In Section~\ref{sec:result}, we provide a detailed regularity analysis and compute the upper bound of the excess risk. We also provide an inequality that quantifies the growth of estimation error under bounded noise. In Section~\ref{sec:experiment}, We design some preliminary experiments to demonstrate the validity of our theory and the results. We conclude in Section~\ref{sec:conclusion}.
