\section{Introduction}
\input{img/fig0}
Click-through rate (CTR) prediction holds significant importance in recommendation systems~\citep{fan2023personalized,ye2022future,rendle2012bpr} and online advertising scenarios. The objective of this task is to gauge the likelihood of a user clicking on a recommended item or an advertisement displayed on a web page. Developing more effective methods for modeling user and item features has emerged as a crucial research challenge in the CTR prediction field. 


To learn comprehensive feature interactions, various models have been developed. On the one hand, \cite{xiao2017attentional, rendle2010factorization, pan2018field}, and \cite{sun2021fm2} learn low-order feature interactions to achieve low temporal and spatial complexity. However, relying solely on first-order and second-order feature interactions results in limited performance~\citep{juan2016field}. On the other hand, models such as AutoInt~\citep{song2019autoint} and SAM~\citep{cheng2021looking} employ self-attention mechanisms to capture high-order feature interactions, thereby enhancing their representation capabilities.
%
To further incorporate richer high-order features, two-stream CTR models that leverage dual parallel networks to capture feature interactions from distinct perspectives (e.g., explicit vs. implicit perspectives) have been developed. For instance, the DCN~\citep{wang2017deep} automates the feature cross-coding process through linear and nonlinear feature learning streams. Building upon the DCN, the GDCN~\citep{wang2023towards} introduces a soft gate to the linear feature learning stream to filter important features. In contrast with the DCN and GDCN, FinalMLP~\citep{mao2023finalmlp} uses two nonlinear feature learning streams to implicitly encode the features at distinct levels by setting multilayer perceptrons (MLPs) with different parameters. 




While the existing two-stream methods have achieved superior performance by diversifying the features through the introduction of various feature interaction streams, there is still substantial room for improvement in implicitly constructing hierarchical features within each stream (please refer to Section~\ref{ablation_exp} for the experimental validation).
Previous works such as \citep{9506870} and \citep{pascanu2013construct} have demonstrated that stacked recurrent structures can learn more intricate representations compared to the feed-forward structures such as MLP structures, allowing each layer to potentially represent different levels of abstraction. By restoring the previous state in the recurrent structure~\citep{quadrana2017personalizing}, the use of a greater number of stacks leads to better integrated global high-level representations (e.g., the coarse-grained common patterns of the user preferences for recommended items), while the use of fewer stacks results in increased feature combinations with local details (e.g., the fine-grained associations between users and items).
To more effectively differentiate among the hierarchical distinctions of the features in each stream, we propose a two-stream multilevel stacked recurrent (MSR) structure that leverages the capacity of the network to capture both global and local features. Based on the proposed stacked recurrent structure, an acceleration module is further introduced to substantially boost the inference efficiency of the proposed approach.




Furthermore, although the two-stream structure can extract richer features, as illustrated in Figure~\ref{fig0} (rows 1 and 2), it is still hindered by the presence of ``spurious correlations'', which refer to statistical relationships between two or more variables that appear to be causal but are not causal. These spurious correlations inherently arise from the subtle connections between noisy features and causal features~\citep{li2022causal}, which reduces the model's generalization ability~\citep{lu2021invariant}. 
For example, in movie recommendation tasks, the prominence of certain trending films may result in higher click counts due to their prioritized placements. However, the actual preferences of users might not align with the genres or contents of these trending movies. This creates spurious correlations between the popularity of certain film types and the genres that users truly appreciate, and movie placement is the confounding factor. Recent studies~\citep{mao2023finalmlp,wang2021dcn,guo2017deepfm} have confirmed through meticulous parameter analyses that model performance decreases when the interaction order exceeds a certain depth, typically three orders~\citep{wang2023towards}, and one of the crucial reasons for this is the exacerbation of the spurious correlation issue. Previous works~\citep{wang2021dcn,wang2023towards} have used gating mechanisms to assign varying levels of importance to different features. However, in the absence of a supporting causal theory~\citep{guan2023knowledge}, feature selection methods fail to identify the true causal features, instead favoring features that arise from spurious correlations. LightDIL~\citep{zhang2023reformulating} divided historical data into multiple periods chronologically, forming a set of environments, and learned stable feature interactions within these environments, yet its effectiveness diminishes when users have constrained click histories.

As shown in Figure~\ref{fig0} (row 3), to address the spurious correlation issue, we employ the Laplacian kernel function to project low-dimensional feature interactions into a high-dimensional space. Thus, nonlinear transformations can be achieved in the low-dimensional space through a linear transformation in the high-dimensional space. Then, we use a sample reweighting strategy that learns different weights for various instances during training to eliminate spurious correlations. The detailed theory is explained in Section~\ref{sample reweighting}.
Leveraging the combined power of the MSR and SCE, we have exploited the network's ability to eliminate the spurious correlations concealed within the hierarchical feature spaces, which enhances the CTR prediction process. The contributions are summarized as follows:
\begin{itemize}
    \item We propose a CTR prediction framework that removes spurious correlations in multilevel feature interactions; this approach leverages the hierarchical causal relationships between items and users to fundamentally enhance the model's generalization ability.
    \item We propose a multilevel stacked recurrent (MSR) structure, which efficiently builds diverse feature spaces to obtain a wide range of multilevel high-order feature representations.
    \item We introduce a spurious correlation elimination (SCE) module, which utilizes Laplacian kernel mapping and sample reweighting methods to eliminate the spurious correlations hidden in multilevel feature spaces. 
    \item The results of extensive experiments conducted on four challenging CTR datasets and our production dataset demonstrate that the proposed RE-SORT achieves state-of-the-art (SOTA) performance in terms of both accuracy and speed. 
\end{itemize}