\section{Method}
\input{img/fig1}
In this section, we introduce the proposed MSR and SCE, the CTR predictor, the loss function, and the theory behind the sample reweighting method in the SCE module.

\subsection{Multilevel stacked recurrent (MSR) interaction}
\label{module2}
The adoption of a two-stream model structure for CTR prediction stems from insights gleaned from prior research and practical applications~\citep{wang2023towards,mao2023finalmlp,wang2021masknet,wang2021dcn}; this structure helps enhance the expressiveness of the constructed model.
Following previous works, the proposed MSR method contains two streams: a deeply stacked recurrent (D-SR) stream and a shallow stacked recurrent (S-SR) stream.
As shown in Figure~\ref{fig1}, each stream includes multiple stacked recurrent blocks with a self-attention structure and varying attenuation coefficients, thus enhancing the depth-based pattern and dependency learning process. 
The input feature of both D-SR and S-SR is denoted as $x_0$, and the outputs of D-SR and S-SR are denoted as $F_d$ and $F_s$, respectively. 
The calculation process is defined as: $F_d =\text{D-SR}(x_0)$, $F_s = \text{S-SR}(x_0)$.
We take the i-th block of the D-SR as an example. 
Given an input $x_{i-1}$, we first project it into a one-dimensional function $w_v$: $V_i = x_{i-1}\cdot w_v$ and then map $V_i$ to $x_i$ through the state $S_i$. 
We recurrently calculate the output as follows: 
\begin{equation}
    S_i = r_i S_{i-1}+K_i^T V_i, 
\end{equation}
\begin{equation}
    x_{i}=\text{SR}_{i}(x_{i-1})=Q_i S_i, 
\end{equation}
where $\text{SR}_{i}$ denotes the i-th block, $Q_i$, $K_i$, and $V_i$ are projections, $r_i$ is the attenuation coefficient, and $i \in 1,2,3,...,M$. 
We further add a swish gate~\citep{ramachandran2017swish} in each stream to increase the non-linearity of MSR layers. 

\noindent\textbf{Enhancing computational efficiency in MSR.}
\label{improve_speed}
In addition, we find that many weights in the softmax function are applied to abnormal or null values, which results in unnecessary computational costs. Therefore, we propose changing the softmax function to learnable matrices by using a relative position encoding~\citep{sun2022length} in the stacked recurrent structure to reduce the incurred temporal and spatial costs.

\input{algori/a1}

Formally, we construct the layer as shown in Algorithm~\ref{algo_interaction_block}, where $P_Z$ and $P_U$ are learnable matrices used to replace the softmax function, and ``GroupNorm''~\citep{shoeybi2019megatron} normalizes the output of each block.
In summary, the different depths of the D-SR and S-SR streams lead to significant differences among the learning feature interactions of any order, enabling the model to grasp the global and local dependencies at various abstraction levels.















\subsection{Spurious correlation elimination }
\label{module4}
Since we've established distinct feature spaces with the MSR, our next objective is to eliminate the spurious correlations concealed within them.
We concatenate the last two outputs $F_d$, $F_s$ to form a local feature map $\text{FM}$.



\noindent \textbf{Statistical independence assessment.}
To eliminate spurious correlations, we use sample reweighting to make the spurious correlations independent of the user click behavior prediction task $y_i$. The underlying theory is elaborated upon in Section~\ref{sample reweighting} and the appendix. 
Specifically, inspired by the kernel function in a support vector machine (SVM), we map the features into a high-dimensional space with a Laplacian kernel. In the high-dimensional space, we eliminate the spurious correlations between features with nonlinear operations corresponding to the original feature space. We use $X$ and $Y$ to denote the features, and the SCE module measures the dependence between them. 
The theory regarding why SCE measures dependence is also given in the appendix. 

\begin{equation}
\text{SCE}(X,Y)=||K_X-K_Y||_{\text{FN}}, \label{eq10}
\end{equation}
where $K_X=k_X(X,X)$ and $K_Y=k_Y(Y,Y)$ are Laplacian kernel matrices, $||\cdot||_{\text{FN}}$ is the Frobenius norm (FN) which corresponds to the Hilbert-Schmidt norm in Euclidean space~\citep{strobl2019approximate}, and $k_X$ and $k_Y$ are Laplacian kernels that can capture local patterns and are robust to outliers. 
However, applying the SCE approach with large-scale kernel matrices can be computationally expensive. Therefore, we use random Fourier features (RFFs) to approximate the Laplacian kernel. Afterwards, the reconstructed features can be obtained in the new representation space, which reduces the temporal complexity of the network from $O(n^2)$ to $O(n)$.
\begin{equation}
    \begin{split}
    \mathcal{H}_{\text{RFF}} = \{h:x \rightarrow \sqrt{2}\cos{(w x+\phi)}| \\
     w \sim N(0,1),\phi \sim U(0,2\pi)\}, \label{eq12}
    \end{split}
\end{equation}
where $\mathcal{H}_{\text{RFF}}$ denotes the function space of RFF, $\mathcal{w}$ is sampled from the standard normal distribution, and $\phi$ is sampled from the uniform distribution.

\input{algori/a2}


We define the partial cross-covariance matrix as the measure of covariance between the two sets of features:
\begin{align}
\text{SCE}_{\text{RFF}}(X,Y) &=||C_{p(X),q(Y)}||_{\text{FN}}^2\\
&= \sum_{i=1}^{n_X} \sum_{j=1}^{n_Y} |\text{Cov}(p_i(X),q_j(Y))|^2, \label{eq14}
\end{align}
where $X$ and $Y$ represent features, $p$ and $q$ are random Fourier feature mapping functions, ${n_X}$ and ${n_Y}$ are the number of functions from $\mathcal{H}_{\text{RFF}}$, $||\cdot||_{\text{FN}}$ is the Frobenius norm (FN), and $C_{p(X),q(Y)}\in \mathbb{R}^{n_X\times n_Y}$ is the cross-covariance matrix of random Fourier features $p(X)$ and $q(Y)$ containing entries:
\begin{align}
p(X) &= (p_1(X),...,p_{n}(X)), p_i(X) \in \mathcal{H}_{\text{RFF}}, \forall i,\\ 
q(Y) &= (q_1(Y),...,q_{n}(Y)), q_j(Y) \in \mathcal{H}_{\text{RFF}}, \forall j. 
\end{align}
\noindent \textbf{Global sample weight optimization.} In this section, we present a method for optimizing the global features and global sample weights to enhance the performance of the model. Our approach aims to effectively capture features and assign appropriate weights to the samples. By iteratively updating the feature weights and incorporating global and local information, we can improve the feature representations and their integration into the model.
By minimizing $\text{SCE}_{\text{RFF}}^w (X,Y)$, we encourage feature independence, resulting in a more causal and consistent covariate matrix. For any two features $X_{:,a},X_{:,b}\in X$, the weighted spurious correlation elimination process is
$\text{SCE}_{\text{RFF}}^w (X,Y)$.

\begin{multline}
\text{SCE}_{\text{RFF}}^w (X_{:,a},X_{:,b},w)\\
= \sum_{i=1}^{n_X} \sum_{j=1}^{n_Y} |\text{Cov}(p_i(w^T X_{:,a}),q_j(w^T X_{:,b}))|^2. \label{eq15}
\end{multline}

In each iteration, our objective is to minimize the sum of Eq. (\ref{eq15}). 
Therefore, the resulting weight is

\begin{align}
w_\text{result} &=\mathop{\arg\min}_{w} \sum_{1\leq a \leq b \leq m}\text{SCE}_{\text{RFF}}^w (X_{:,a},X_{:,b},w) \label{eq16} \\ 
&= \mathop{\arg\min}_w \sum_{1\leq a \leq b \leq m} \sum_{i=1}^{n_X} \sum_{j=1}^{n_Y} |\text{Cov}(u_i,v_j)|^2 \label{eq17}, && 
\end{align}
where $u_i=p_i(w^T X_{:,a})$, $v_j=q_j(w^T X_{:,b})$.
To avoid local optima, we incorporate global information. We employ a memory module consisting of $\text{GFI}$ and $\text{GWI}$. $\text{GFI}$ captures global feature information, while $\text{GWI}$ stores global weight information. 
The features and weights are used to optimize the new sample weights.
\begin{equation}
\text{GFI}_i = \text{Concat}(\text{GFI}_{i-1},\text{FM}_i), \label{eq18}
\end{equation}
\begin{equation}
\text{GWI}_i = \text{Concat}(\text{GWI}_{i-1},w_i), \label{eq19}
\end{equation}
where $\text{GFI}_1=\text{FM}_1$, $\text{GWI}_1=w_1$, $\text{FM}_i$ denotes the local feature information, $w_i$ is the local weight information, and $i$ is means the number of iterations.
We globally update the features and weights as follows:
\begin{equation}
\text{GFI}_{i}' = \frac{1}{2}(\text{GFI}_{i-1}'+ \text{FM}_i)\label{updatef},
\end{equation}
\begin{equation}
\text{GWI}_{i}' =\frac{1}{2}( \text{GWI}_{i-1}'+w_i)\label{updatew},
\end{equation}
where $\text{GFI}_{1}'=\frac{1}{2}\text{FM}_1$ and $\text{GWI}_{1}'=\frac{1}{2}w_1$.

\subsection{CTR predictor}
\label{module5}
We divide the $F_d$ and $F_s$ into $k$ chunks denoted as $F_d = [F_{d_1},...,F_{d_k}]$, $F_s = [F_{s_1},...,F_{s_k}]$, where $k$ is a hyperparameter. $F_{d_j}$ and $F_{s_j}$ denote the j-th chunk feature of the D-SR and S-SR outputs, respectively. 
We apply the CTR predictor (CP) to each paired chunk group consisting of $F_{d_j}$ and $F_{s_j}$. The chunk computations are then aggregated using sum pooling to derive the final predicted probability:
\begin{equation}
\text{CP}(F_{d_j},F_{s_j})=b+w_{d_j}^TF_{d_j}+w_{s_j}^TF_{s_j}+F_{d_j}^TW_jF_{s_j}, \label{22}
\end{equation}
\begin{equation}
\hat y = \sigma(\sum_{j=1}^{k}\text{CP}(F_{d_j},F_{s_j})), \label{23}
\end{equation}
where $w_{d_j}$, $w_{s_j}$, and $W_j$ are learnable weights, and $ \sigma$ is the sigmoid activation function.
Modeling the second-order interactions between hierarchical features $F_{d_j}^TW_jF_{s_j}$ actually involves modeling arbitrary-order feature interactions.

\subsection{Loss function}
\label{module6}
We introduce a reweighting scheme for each sample within the commonly utilized binary cross-entropy loss. This involves applying a distinct weight to the loss of each sample during training.
\begin{equation}
L = -\sum_{i=1}^{N} w_{\text{result}_i}(y_i \log(\hat{y}_i) + (1 - y_i) \log(1 - \hat{y}_i)), \label{eq25}
\end{equation}
where $N$ is the number of examples, $y_i$ is the true label of instance i, and $\hat{y}_i$ is the predicted probability of a click. During training, we use the weight $w_\text{result}=[w_{\text{result}_1},w_{\text{result}_2},....w_{\text{result}_N}]$ of the $N$ LogLoss values calculated for every input sample with the sample reweighting mechanism to conduct gradient descent.

\input{img/theory}