\section{Methodology}
The short-term FTP task can be formulated as a Multivariate Time Series (MTS) forecasting problem. Given a sequence of historical observations $\mathbf{X} = \left\{\mathbf{x}_1,...,\mathbf{x}_L\right\} \in \mathbb{R}^{C \times L}$,
where  $C$ is the state dimension, $L$ is the look-back window size and $\mathbf{x}_t \in \mathbb{R}^{C \times 1} $  is the flight state at time step $t$. The task is to predict future $T$ time steps  $\hat{\mathbf{Y}} = \left\{\hat{\mathbf{x}}_{L+1},...,\hat{\mathbf{x}}_{L+T}\right\} \in \mathbb{R}^{C' \times T}$, where $C'$ is the predicted state dimension. In this work, the flight state  $\mathbf{x}_t$ represents longitude, latitude, altitude, and velocities along the previous three dimensions, i.e., $\mathbf{x}_t =(Lon_t, Lat_t, Alt_t, Vx_t, Vy_t, Vz_t)^\top$.

The overall architecture of \textbf{FlightPatchNet} is shown in Figure~\ref{fig:model}, which consists of Global Temporal Embedding, Multi-Scale Patch Network, and Predictors. \textbf{Global Temporal Embedding} first utilizes differential coding to transform the original values of longitude and latitude into first-order differences and embeds all variables of the same time step into temporal tokens. Global temporal attention is then introduced to capture the inherent dependencies between different tokens. \textbf{Multi-Scale Patch Network} is proposed to serve as the backbone which is composed of stacked patch mixer blocks and a multi-scale aggregator. Stacked patch mixer blocks divide trajectory series into patches of different sizes from large scale to small scale. Based on divided patches, each patch mixer block exploits a patch encoder and decoder to capture inter- and intra-patch dependencies, endowing our model with powerful temporal modeling capability. To further integrate multi-scale temporal patterns, a multi-scale aggregator is incorporated into the network to capture scale-wise correlations and inter-variable relationships. \textbf{Predictors} provide direct multi-step trajectory forecasting and each predictor is a fully connected network. All the predictor results are aggregated to reconstruct the final prediction trajectory. 
\begin{figure*}[h]
    \centering
    \includegraphics[width=0.85\linewidth]{figure/main_model.pdf}
    \captionsetup{font=small}
    \caption{FlightPatchNet architecture. (a) Global Temporal Embedding to explore the correlations between different time steps. (b) Multi-Scale Patch Network to capture inter- and intra-patch dependencies under different time scales and integrate temporal features across scales and variables. (c) Predictors to exploit complementary temporal features and make direct multi-step prediction. }
    \label{fig:model}
\end{figure*}
 
\subsection{Global Temporal Embedding}
\paragraph{Differential Coding} In the context of the WGS84 Coordinate System, the longitude and latitude are limited to the intervals $[-180^\circ, 180^\circ]$ and $[-90^\circ, 90^\circ]$ respectively, while the altitude can span from 0 up to tens of thousands of meters. The significant differences of data range caused by physical units may impair the trajectory prediction performance. Generally, normalization algorithms are applied to address this issue. However, the normalized prediction errors should be transformed into raw data range to evaluate the actual performance. For example, if the absolute prediction error of longitude is $10^{-4}$ after using the Min-Max normalization algorithm, the actual prediction error is 0.036° (approximately 4000 meters). Moreover, as shown in Figure 1, the original series of longitude and latitude are over-smoothing and reflect the overall flight trend over a period. If temporal patterns are learned from the original values of longitude and latitude, the model fails to explore the implicit semantic information and cannot focus on short-term temporal variations in flight trajectories.

To address the above issues, we utilize first-order differences for longitude and latitude while original values for other variables, then the differential values are transformed into meters. This process can be formulated as:
\begin{equation}\label{equ:differential}
\left\{
    \scalebox{0.9}{$\begin{aligned}
      \Delta_{\mathit{Lon}} &= 2R\times arcsin(\sqrt{cos^2(\varphi_{t-1})sin^2(\frac{\phi_{t}-\phi_{t-1}}{2}}))  \\
      \Delta_{\mathit{Lat}} &= 2R \times arcsin(\sqrt{sin^2(\frac{\varphi_{t}-\varphi_{t-1}}{2})}) \\
    \end{aligned}$}
    \right.
\end{equation}
where $\phi$ denotes the longitude,  $\varphi$ denotes the latitude, and $R$ is the radius of the earth. By using differential coding for longitude and latitude, the differences in data range are effectively reduced. For example, in our dataset, the range of latitude in original data is about $[-46^\circ, 70^\circ]$ and that in differential data is about $[-3860m, 3860m]$, which spans a similar data range as the altitude. Compared to the original sequences, the differential series can explicitly reflect the underlying temporal variations, which is essential for short-term temporal modeling. Besides, adopting the first-order differences instead of second- or higher-order differences enables the model to reconstruct the predicted trajectory based on the last observation. Note that
we utilize the original values of altitude as inputs rather than differential values. One important reason is that altitude is more susceptible to noise, failing to reflect the actual temporal variations. To this end, the flight state at time step $t$  becomes $ \mathbf{x}_t=(\Delta_{\mathit{Lon}},\Delta_{\mathit{Lat}}, Alt_t, Vx_t, Vy_t, Vz_t)^\top$.

\paragraph{Global Temporal Attention} 
Given the trajectory series $\mathbf{X} \in \mathbb{R}^{C \times L} $,  we first project flight state at each time step into $d$ dimension to generate temporal embeddings $ \mathbf{T}^0  \in \mathbb{R}^{L\times d}$. Then, we apply multi-head self-attention (MSA) \citep{Vaswani2017AttentionIA} on the dimension $L$ to capture the dependencies across all time steps. After attention, the embedding at each time step is enriched with temporal information from other time steps.  This process is formulated as:
\begin{equation}
    \scalebox{0.9}{$
      \begin{aligned}
      \mathbf{T}^0 &= \mathit{TimeEmbedding}(\mathbf{X}^\top) \\
       \mathbf{T}^{i}&=\mathit{LayerNorm}(\mathbf{T}^{i-1}+\mathit{MSA}(\mathbf{T}^{i-1}), i=1,\dots,l \\    \mathbf{T}^{i}&=\mathit{LayerNorm}(\mathbf{T}^{i}+FC(\mathbf{T}^{i}), i=1,\dots,l   \\
       \mathbf{Z} &={(\mathit{Linear}(\mathbf{T}^{l}}))^\top
      \end{aligned}$}
  \end{equation}
 where $l$ is the number of attention layers, $LayerNorm$ denotes the layer normalization \citep{ba2016layer} which has been widely adopted to address non-stationary issues, $MSA$ is the multi-head self-attention layer, $FC$ denotes a fully-connected layer and  $Linear$ projects the embedding of each time step to dimension $C$, i.e., $\mathbb{R}^{d} \rightarrow \mathbb{R}^C$. 

\subsection{Multi-Scale Patch Network}
Considering different temporal patterns prefer diverse time scales, the multi-scale patch network first utilizes a stack of $K$ patch mixer blocks to capture underlying temporal patterns from large scale to small scale. A large time scale can reflect the slow-varying flight trends, while a smaller scale can retain fine-grained local details. To further promote the collaboration of diverse temporal features, a multi-scale aggregator is introduced to consider the contributed scales and dominant variables. Such a multi-scale network equips our model with powerful and complete temporal modeling capability and helps preserve all kinds of multi-scale characteristics.

\subsubsection{Patch Mixer Block}
\paragraph{Patching} Only considering one single time step is insufficient for the FTP task, since it contains limited semantic information and cannot accurately reflect the flight trajectory variations. Inspired by PatchTST \citep{Yuqietal-2023-PatchTST}, the trajectory representation $\mathbf{Z} \in\mathbb{R}^{C\times L}$ is segmented into several non-overlapping patches along the temporal dimension, generating a sequence of patches $\mathbf{Z}_p\in \mathbb{R}^{C\times P\times N}$, where $P$ is the length of each patch, $N$ represents the number of patches, and $N=\left\lceil\frac{L}{P}\right\rceil$. The patching process is formulated as:
\begin{equation}
 {\mathbf{Z}}_p =  {Reshape}({ZeroPadding}(\mathbf{Z})) 
\end{equation}
where $ZeroPadding(\cdot)$ refers to padding series with zeros in the beginning to ensure the length is divisible by $P$.

\paragraph{Patch Encoder-Decoder}
Based on the divided patches $\mathbf{Z}_p $, we utilize a patch encoder and decoder to capture temporal features in flight trajectories. Specifically, the patch encoder aims to capture the inter-patch features (i.e., the global correlations across patches) and intra-patch features (i.e., the local details within patches). After that, these features are reconstructed to the original dimension by the patch decoder. Due to the superiority of linear models for MTS \citep{chen2023tsmixer, zeng2023transformers}, the patch encoders and decoders are based on pure multi-layer perceptron (MLP) for temporal modeling.

\begin{figure}[htbp]
    \centering    
    \includegraphics[width=0.98\linewidth]{figure/patchmixerv3.pdf}
    \captionsetup{font=small}
    \caption{The structure of Patch Mixer Block}   
    \label{fig:patchmixer}   
\end{figure}
As illustrated in Figure~\ref{fig:patchmixer}, a patch encoder consists of an inter-patch MLP, an intra-patch MLP, and a linear projection. Each MLP has two fully connected layers, a GELU non-linearity layer and a dropout layer with a residual connection.

Given the patch-divided series $\mathbf{Z}_p$, an  inter-patch MLP performs on the dimension $N$ to  capture the dependencies between different patches, which maps $ \mathbb{R}^{N} \rightarrow \mathbb{R}^{N}$ to obtain the inter-patch mixed representation $ \mathbf{N}_{inter} \in \mathbb{R}^{C \times P\times N}$:
\begin{equation}
     \mathbf{N}_{inter} = \mathbf{Z}_{p} + Dropout(FC(\sigma(FC( \mathbf{Z}_p)))) 
\end{equation}
where $\sigma$ denotes a GELU non-linearity layer, $Dropout$  denotes a dropout layer and $\mathbf{N}_{inter}$ reflects the global correlations across patches. After that, an intra-patch MLP  performs on the dimension $P$  to capture the dependencies  across different time steps within patches, which maps $ \mathbb{R}^{P} \rightarrow \mathbb{R}^{P}$ to obtain  the intra-patch mixed representation $ \mathbf{N}_{intra} \in \mathbb{R}^{C \times N\times P}$:
\begin{equation}
    \mathbf{N}_{intra} = \mathbf{N}_{inter}^\top + Dropout(FC(\sigma(FC( \mathbf{N}_{inter}^\top))))
\end{equation} where $ \mathbf{N}_{intra} $ reflects the local details between different time steps within patches. Then, we perform a linear projection on  $\mathbf{N}_{intra}^\top$ to obtain the final inter- and intra-patch mixed  representation $\mathbf{E}$ $\in \mathbb{R}^{C \times P\times 1}$:
\begin{equation}
     \mathbf{E} = \mathit{Linear}( \mathbf{N}_{intra}^\top)
\end{equation}
After such a patch encoding process, the correlations between nearby time steps within patches and distant time steps across patches are finely explored. Then, we utilize a patch decoder to reconstruct the original sequence. A patch decoder comprises the same components as the encoder in a reverse order, which is formulated as follows:
\begin{equation}
    \begin{aligned}
 \mathbf{D} &= Linear( \mathbf{E}) \\
 \mathbf{P}_{intra} &= \mathbf{D}^\top + Dropout(FC(\sigma(FC(\mathbf{D}^\top))))\\
  \mathbf{P} &= \mathbf{P}_{intra}^\top + Dropout(FC(\sigma(FC(\mathbf{P}_{intra}^\top))))
    \end{aligned}
\end{equation}
where $Linear$ makes a dimensional projection to obtain $ \mathbf{D} \in \mathbb{R}^{C \times P\times N}$ for reconstructing the original sequence, $\mathbf{P}_{intra} \in \mathbb{R}^{C \times N\times P}$ is the reconstructed intra-patch mixed representation, and $\mathbf{P} \in \mathbb{R}^{C \times P\times N} $ is the final reconstructed intra- and inter-patch mixed representation.

\subsubsection{Multi-Scale Aggregator} 
To enable the ability of more complete multi-scale modeling, we introduce a multi-scale aggregator to integrate different temporal patterns. It contains two components: scale fusion and channel fusion. Scale fusion can figure out critical time scales and capture the scale-wise correlations, while channel fusion can discover dominant variables affecting temporal variations and explore the inter-variable relationships. These two components work together to help the model learn a robust multi-scale representation and improve generalization ability. 
Given the $K$ scale-specific temporal representations $\{\mathbf{P}_1,\mathbf{P}_2,\dots,\mathbf{P}_K\}$, we first stack them and rearrange the data to combine the three dimensions of channel size $C$, patch size $P$ and patch quantity $N$, resulting in $\mathbf{S}^0 \in \mathbb{R}^{K\times (C\times L)}$, where $L = P \times N$. Then we apply MSA on the scale dimension $K$ to learn the importance of contributed time scales. This process is formulated as:
\begin{equation}
      \begin{aligned}
      \mathbf{S}^0&= Reshape(Stack(\mathbf{P}_1,\mathbf{P}_2,\dots,\mathbf{P}_K) ) \\
       \mathbf{S}^{i}&=LayerNorm(\mathbf{S}^{i-1}+MSA(\mathbf{S}^{i-1})\\     
       \mathbf{S}^{i}&=LayerNorm(\mathbf{S}^{i}+FC(\mathbf{S}^{i}), i=1,\dots,l    
      \end{aligned}
  \end{equation}
where $\mathbf{S}^l$ is the final multi-scale fusion representation within variables.
Inspired by iTransformer \citep{liu2023itransformer}, we consider each variable as a token and apply MSA to explore dependencies between different variables. We first reshape the  $\mathbf{S}^l$ to get  $\mathbf{C}^0$ $ \in \mathbb{R}^{C \times (K\times L)}$ and perform multi-head self-attention on the channel dimension $C$ to identify dominant variables. This process is simply formulated as follows:
\begin{equation}
    \begin{aligned}
        \mathbf{C}^0 &= Reshape(\mathbf{S}^l) \\
        \mathbf{C}^{i} &=LayerNorm(\mathbf{C}^{i-1}+ \mathit{MSA}(\mathbf{C}^{i-1})) \\
         \mathbf{C}^{i}&=\mathit{LayerNorm}(\mathbf{C}^{i}+FC(\mathbf{C}^{i}),i=1,\dots,l \\
         \mathbf{H} &= Reshape(\mathbf{C}^l) 
    \end{aligned}
\end{equation} where $\mathbf{H} \in \mathbb{R}^{ C  \times L \times K}$ is the final multi-scale representation which involves cross-scale correlations and inter-variable relationships.
\subsection{Direct Multi-Step Prediction}
We ensemble $K$ predictors to directly obtain the future flight trajectory series, which can exploit complementary information from different temporal patterns. The objective of our model is to predict the differential values of longitude and latitude relative to the last observation, and the raw absolute values of altitude, i.e., $\hat{\mathbf{Y}} = \left\{\hat{\mathbf{x}}_{L+1},...,\hat{\mathbf{x}}_{L+T}\right\}$, where 
$ \hat{\mathbf{x}}_{L+i} =( \hat{\Delta}^{\mathit{Lon}}(L+i,L) ,  \hat{\Delta}^{\mathit{Lat}}(L+i,L),\hat{Alt}_{L+i})^\top$ for $i=1,\dots,T$. 
We split the final multi-scale representation $\mathbf{H} \in \mathbb{R}^{ C  \times L \times K}$ into a sequence $\left\{ \mathbf{H}_{*, 1},\mathbf{H}_{*, 2},\dots,\mathbf{H}_{*, K}\right\}$, where $\mathbf{H}_{*, i}\in \mathbb{R}^{ C  \times L}$ for $i=1,\dots,K$, and feed each $\mathbf{H}_{*, i}$ to a predictor. Each predictor has two MLPs. The first $MLP_{C_i}$ transforms the input channel $C$ into the output channel $C'$, and the second $MLP_{T_i}$ projects the historical input sequence $L$ to the prediction horizon $T$. 
\begin{equation}
    \begin{aligned}
        \hat{\mathbf{Y}_i} = &MLP_{T_i}(MLP_{C_i}(\mathbf{H}_{*, i})) \\
\hat{\mathbf{Y}}=&\sum\limits_{i=1}^K \hat{\mathbf{Y}_i}
    \end{aligned}
\end{equation}
Finally, all the predictor results are aggregated to reconstruct the final prediction trajectory according to Equation (\ref{equ:differential}), which can enhance the stability and generalization of our model.
