\textbf{Datasets. } 
We use four publicly available datasets as shown in Table~\ref{tab:data_stats}: \textbf{CiteULike-a}\footnote{http://wanghao.in/CDL.htm} contains interactions between users and scientific articles; 
\textbf{MovieLens}\footnote{https://grouplens.org/datasets/movielens/} includes users' ratings on movies; 
\textbf{Cell Phones} and  \textbf{ Video Games} contain user' reviews on Cell Phones \& Accessories and Video Games categories of \emph{Amazon dataset}\footnote{https://nijianmo.github.io/amazon/index.html}. 
% We present data preprocessing in appendix.
\input{table/data_stats/data_stats}
% For CiteULike-a, Cell Phones and Video Games datasets, we use the accompanying textual content, i.e., title \& abstract for CiteULike-a and item descriptions for Amazon categories. For Cell Phones, we retain users with at least $8$ interactions and items with at least $5$ interactions and for Video Games, these numbers are 5 and 5, respectively. For MovieLens, we follow \cite{MDCVAE:2022} to extract a subset of users from ML-10M version. We keep user ratings larger than $3$ as interactions \cite{macridvae:2019} and collect item textual content from IMDB \footnote{https://datasets.imdbws.com/}. For all datasets, we remove stop words and only keep words with frequency higher than $3$ and appearing in less than $60\%$ of item texts and retain top $8k$ words with highest frequency as in \cite{MDCVAE:2022}.\looseness=-1
% % , top $8k$ words with highest frequency are retained to construct vocabulary. 
% % \input{table/data_stats/data_stats}

% We adopt \emph{strong generalization} setting as in \cite{macridvae:2019} to construct training, validation and test sets by randomly choosing $80\%$ of users for training and $10\%$ of users for each validation and test sets. For validation and test sets, $20\%$ of a user interactions is kept as the ground truth. To keep the quality of datasets, we only retain items with at least $5$ words in their textual content so that the textual content brings semantic information. All cold-start items, i.e., those do no appear in training set, are discarded since there is no parameters associating with them, following the common practice in the field.
For CiteULike-a, Cell Phones and Video Games datasets, we use the accompanying textual content, i.e., title \& abstract for CiteULike-a and item descriptions for Amazon categories. For Cell Phones, we retain users with at least $8$ interactions and items with at least $5$ interactions and for Video Games, these numbers are 5 and 5, respectively. For MovieLens, we follow \cite{MDCVAE:2022} to extract a subset of users from ML-10M version. We keep user ratings larger than $3$ as interactions \cite{macridvae:2019} and collect item textual content from IMDB \footnote{https://datasets.imdbws.com/}. For all datasets, we remove stop words and only keep words with frequency higher than $3$ and appearing in less than $60\%$ of item texts and retain top $8k$ words with highest frequency as in \cite{MDCVAE:2022}. These strategies help ensure that even short or noisy item descriptions contribute meaningful information. Moreover, these steps are employed across baselines, ensuring fair comparison. We keep these pre-processing steps at minimal complexity so that the performance gain is attributed to our proposed aligning mechanism. Employing advanced methods to generate clean text would potentially enhance our proposed framework.\looseness=-1
% , top $8k$ words with highest frequency are retained to construct vocabulary. 
% \input{table/data_stats/data_stats}

We adopt \emph{strong generalization} setting as in \cite{macridvae:2019} to construct training, validation and test sets by randomly choosing $80\%$ of users for training and $10\%$ of users for each validation and test sets. For validation and test sets, $20\%$ of a user interactions is kept as the ground truth. To keep the quality of datasets, we only retain items with at least $5$ words in their textual content so that the textual content brings semantic information. All cold-start items, i.e., those do no appear in training set, are discarded since there is no parameters associating with them, following the common practice in the field.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\textbf{Baselines. } We compare \ourmethod\ against state-of-the-art models, including
% capable of making recommendations for unseen users in strong generalization setting. 
models only utilizing ratings \textbf{MacridVAE} \cite{macridvae:2019},  \textbf{RecVAE} \cite{RecVAE:2020}, \textbf{ELSA} \cite{ELSA:2022}, \textbf{VALID} \cite{VALID:2023}, \textbf{FacetVAE} \cite{FacetVAE:2024} and models using both ratings and texts \textbf{MDCVAE} \cite{MDCVAE:2022}, \textbf{TopicVAE}\cite{TopicVAE:2022}, \textbf{ADDVAE} \cite{ADDVAE:2022} and \textbf{SEM-MacridVAE} \cite{SemMacridVAE:2023}. Among these, \textbf{RecVAE, ELSA} and \textbf{MDCVAE} are single-interest modeling models while \textbf{MacridVAE, TopicVAE, ADDVAE, SEM-MacridVAE, VALID, FacetVAE} are multi-interest modeling models. 
\begin{itemize}[leftmargin=*]
    \item \textbf{MacridVAE} \cite{macridvae:2019} introduces macro- and micro-disentanglement of user preferences via multi-prototype representation and independence regularization.\looseness=-1
    \item \textbf{RecVAE} \cite{RecVAE:2020} proposes composite prior, rescaling regularization term and an alternative training into a novel VAE-based recommendation model. 
    \item \textbf{MDCVAE} \cite{MDCVAE:2022} regularizes decoder weights of the user-oriented autoencoder by latent embeddings inferred from textual content.
    \item \textbf{TopicVAE} \cite{TopicVAE:2022} improves disentangling user preferences by designing attention-based topic extraction from textual content, topic-guided contrastive loss and heuristic method to set value of regularization term.
    \item \textbf{ADDVAE} \cite{ADDVAE:2022} leverages two disentangled networks to model user's ratings and user associated texts then aligns disentangled factors from these two modalities using compositional de-attention and regularization.
    \item  \textbf{ELSA} \cite{ELSA:2022} improves SOTA linear autoencoder by factorizing hidden space into a low-rank plus sparse structure.\looseness=-1
    \item \textbf{SEM-MacridVAE} \cite{SemMacridVAE:2023} exploits semantic knowledge from side information to improve VAE-based disentangled recommendation models. We use tf-idf item-word matrix, i.e., $\textbf{W} = \{\textbf{w}^i\}_{i=1}^N$, as side information for fair comparison.
    \item \textbf{VALID} \cite{VALID:2023} improves VAE-based disentangling user interests by iterative latent attention and implicit differentiation.\looseness=-1
    \item  \textbf{FacetVAE} \cite{FacetVAE:2024} disentangles multi-faceted item space and derive compositional user interests via bi-directional binding.\looseness=-1
\end{itemize}

% Their descriptions can be found in the appendix.\looseness=-1
% \textbf{Baseline Description}
% We compare \ourmethod\ against a series of comparable state-of-the-art recommendation models capable of handling unseen users in strong generalization setting.
% \begin{itemize}[leftmargin=*]
%     \item \textbf{MacridVAE} \cite{macridvae:2019} introduces macro- and micro-disentanglement of user preferences via multi-prototype representation and independence regularization.\looseness=-1
%     \item \textbf{RecVAE} \cite{RecVAE:2020} proposes composite prior, rescaling regularization term and an alternative training into a novel VAE-based recommendation model. 
%     \item \textbf{MDCVAE} \cite{MDCVAE:2022} regularizes decoder weights of the user-oriented autoencoder by latent embeddings inferred from textual content.
%     \item \textbf{TopicVAE} \cite{TopicVAE:2022} improves disentangling user preferences by designing attention-based topic extraction from textual content, topic-guided contrastive loss and heuristic method to set value of regularization term.
%     \item \textbf{ADDVAE} \cite{ADDVAE:2022} leverages two disentangled networks to model user's ratings and user associated texts then aligns disentangled factors from these two modalities using compositional de-attention and regularization.
%     \item  \textbf{ELSA} \cite{ELSA:2022} improves SOTA linear autoencoder by factorizing hidden space into a low-rank plus sparse structure.\looseness=-1
%     \item \textbf{SEM-MacridVAE} \cite{SemMacridVAE:2023} exploits semantic knowledge from side information to improve VAE-based disentangled recommendation models. We use tf-idf item-word matrix, i.e., $\textbf{W} = \{\textbf{w}^i\}_{i=1}^N$, as side information for fair comparison.
%     \item \textbf{VALID} \cite{VALID:2023} improves VAE-based disentangling user interests by iterative latent attention and implicit differentiation.\looseness=-1
%     \item  \textbf{FacetVAE} \cite{FacetVAE:2024} disentangles multi-faceted item space and derive compositional user interests via bi-directional binding.\looseness=-1
% \end{itemize}

We follow the strong generalization setting in \cite{macridvae:2019}, i.e., validation and test sets include unseen users. Thus, we only involve baselines capable of predicting interactions for unseen users. While models SLIM, EASE, SimpleX are capable, they have been already outperformed by other baselines RecVAE, ELSA and VALID. Thus, we only retain state-of-the-art models as our baselines.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \textbf{Implementation. } We train \ourmethod\ via Adam optimizer and learning rate $\mathrm{0.001}$ on NVIDIA RTX 2080 Ti GPU machine. All model uses embedding size $d = 100$ for fair comparison. The default settings are $K = 4$ and $J = 4$. (More values of $K$ and $J$ are analyzed in subsequent sections. Architecture of fusion network $\zeta: 2d \rightarrow d/2 \rightarrow 1$. Training stops after $30$ epochs without improving performance on validation set. 
% % The final hyper-parameters of \ourmethod\ can be found in the accompanying code.
% We report Recall and NDCG at top 10 and 50 with full-ranking strategy \cite{recsys_eval_setting:2020}, i.e., test item is ranked against all items to avoid sampling bias. 
\textbf{Implementation. }
For all models, we choose the hyper-parameters based on performance on validation set. Then, we retrain and report performance on test set, which is averaged over ten runs on NVIDIA RTX 2080 Ti GPU machine. 
Pertaining to baselines, we follow their original papers to choose hyper-parameters by performing grid search in the same range described in those papers. 
Regarding \ourmethod, the default settings are $D = 300$ for MovieLens and Cell Phones and $D=600$ for CiteULike-a and Video Games after tuning from \{100, 200, 300, 500, 600\}; embedding size $d = 100$ for all datasets; dropout rate applied for $\textbf{A}^{uy}$ and $\textbf{A}^{ut}$ is $0.5$; number of rating and text factors are $K = 4$ and $J = 4$, respectively (more values of $K$ and $J$ are analyzed in subsequenct sections; $\beta^y$ and $\beta^t$ follow annealing process $min(\beta_0, \frac{update}{T})$ where $\beta_0 = 1$ for rating channel and $\beta_0 = 0.2$ for text channel, $T$ is chosen from $\{1k, 5k, 10k, 20k\}$, and $update$ is the number parameter updates; $\sigma^y$ and $\sigma^t$ are chosen from $\{0.05, 0.075, 0.1\}$; the search space of $\lambda_t$ and $\lambda_r$ is $\{0.1, 0.2, 0.5, 1, 2, 5\}$; $\epsilon \in \{0.2, 0.5, 1\}$ in Sinkhorn algorithm. Archiecture of fusion network $\zeta: 2d \rightarrow d/2 \rightarrow 1$. The number of prototype update steps $L^y$ in rating encoder are chosen from $\{2, 3, 4\}$ while $L^t = 1$. We train \ourmethod\ using Adam optimizer with learning rate $0.001$ on  NVIDIA RTX 2080 Ti GPU machine. Training stops after $30$ epochs without improving performance on validation set. 
We report Recall and NDCG at top 10 and 50 with full-ranking strategy \cite{recsys_eval_setting:2020}, i.e., test item is ranked against all items to avoid sampling bias.
% \emph{For review purpose, we release code and data in the anonymous link \url{https://tinyurl.com/36xtub32}}. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \emph{Due to limited space, we present details of datasets, baselines and hyper-parameters in the supplementary material.}\looseness=-1
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{table/recom_result/recom_result}
\subsection{Recommendation Performance}\label{sec:recom_result}
Table \ref{tab:recom_result} reports the recommendation performance. 
\emph{First}, \ourmethod\ achieves significantly higher accuracy than baselines using textual content for multi-interest modeling TopicVAE, SemMacridVAE, and ADDVAE on CiteULike-a, Cell Phones, and Video Games, demonstrating the advantage of its optimal transport-based alignment and fusion. 
\emph{Second}, \ourmethod\ also outperforms multi-interest models that do not use textual content MacridVAE, VALID, and FacetVAE, underscoring the value of aligning rating and text factors. Additionally, \ourmethod\ surpasses single-interest models MD-CVAE, RecVAE, and ELSA, highlighting the importance of capturing multiple interests. 
\emph{Third}, on MovieLens, while \ourmethod\ performs consistently across metrics, some baselines excel only at specific metrics. For example, RecVAE’s composite prior aids Recall@10, but \ourmethod\ achieves notably higher Recall@50 and NDCG@50. SEM-MacridVAE and TopicVAE learns item representations from texts, attaining comparable accuracy with \ourmethod\ w.r.t. only top 10 metrics. In contrast, \ourmethod\ is evidently better than these two w.r.t. top 50 metrics.\looseness=-1
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Model Analysis}\label{sec:model_analysis}
We conduct experiments to gain insights into \ourmethod's inner working. We present more ablative studies in the appendix to further understand \ourmethod.
%%%%%%%%%%%%%%%%%%%%
\input{figures/alignment_probs/alignment_probs}
\textbf{Alignment method. } 
In Figure \ref{fig:align_transfer}, we analyze three alternatives to understand the derivation of $\pi^u$.\looseness=-1
\begin{itemize}[leftmargin=*]
    \item \emph{Sinkhorn (Sink)} is Sinkhorn algorithm in Algorithm ~\ref{alg:sinkhorn_alg}.
    \item \emph{Normalization (Norm)} generates $\pi^u$ by normalizing negative distance between disentangled factors from two modalities, i.e., $\pi^u_{kj} = \frac{exp(-||\textbf{z}^{uy}_k - \textbf{z}^{ut}_j||^2_2 / \epsilon)}{\sum_{k=1}^K\sum_{j=1}^Jexp(-||\textbf{z}^{uy}_k - \textbf{z}^{ut}_j||^2_2 / \epsilon)}$.
    \item \emph{Diagonal (Diag)} assumes $k^{th}$ rating factor aligned with $k^{th}$ text factor, i.e., $\pi^u_{kj} = 1 / K$ if $k = j$, otherwise $\pi^u_{kj} = 0$. This approach is only applicable when $K = J$.\looseness=-1
\end{itemize}
First, Sinkhorn outperforms normalization across all datasets, as it converges to the optimal transport solution \cite{computational_OT:2019}, avoiding skewed alignment matrices that over-concentrate probabilities on highly similar pairs. Second, normalization generally matches or exceeds diagonal’s accuracy, except for NDCG@10 on Cell Phones and Video Games, highlighting the importance of capturing pairwise alignments. In contrast, the diagonal approach’s rigid one-to-one assumption leads to suboptimal performance, underscoring the value of probabilistic alignment.

%%%%%%%%%%%%%%%%%%%%
\textbf{Interest transfer method.} Figure \ref{fig:align_transfer} reports recommendation accuracy w.r.t. three interest transfer methods.
\begin{itemize}[leftmargin=*]
    \item \emph{Combination (Com)} includes both regularization and mapping \& fusing inside \ourmethod.
    \item \emph{Mapping \& Fusing (Map)} only includes mapping and fusing for interest transfer (no regularization). 
    \item \emph{Regularization (Reg)} only involves regularization for interest transfer (no mapping and fusing).
\end{itemize}
% First, \emph{regularization} and \emph{mapping \& fusing} complement each other to boost \ourmethod's performance, explaining why \emph{combination} achieves higher accuracy than its alternatives.
% Second, \emph{mapping \& fusing} has a stronger effect than \emph{regularization}, demonstrating the significance of 
% mutually transferring supervision signals between ratings and texts. Third, \ourmethod's accuracy reduces when excluding \emph{regularization}-based interest transfer, which confirms its helpfulness. 
First, regularization and mapping \& fusing complement each other, with their combination achieving higher accuracy than either alone. Second, mapping \& fusing has a stronger impact than regularization, highlighting the importance of bidirectional interest transfer between ratings and texts. Third, excluding regularization reduces \ourmethod’s accuracy, confirming its role in enhancing performance.

%%%%%%%%%%%%%%%%%%%%
% \input{table/ablation/fusion_lambda_t}
\input{figures/fusion_method/fusion_method}
\textbf{Fusion method.} 
% A powerful fusion layer would produce expressive representation from two modalities, subsequently boosting recommendation accuracy. 
Figure \ref{fig:fusion_lambda_text} compares our \emph{adaptive fusion} method against \emph{Mean-T} employed in \cite{ADDVAE:2022}.\looseness=-1
\begin{itemize}[leftmargin=*]
    \item \emph{Adaptive} learns the adaptive fusion weight $\rho^{uy}$ (and $\rho^{ut}$) for each user $u$ as in Equations \ref{eqn:fuse_rating} and Equation \ref{eqn:fuse_text}.
    % \item \emph{Mean} takes the average of rating (text) factors and their corresponding transformed version in text (rating) space, i.e., $\tilde{\textbf{z}}^{uy}_k = \frac{1}{2}(\textbf{z}^{uy}_k + \hat{\textbf{z}}^{ut}_k)$ and $\tilde{\textbf{z}}^{ut}_j = \frac{1}{2}(\textbf{z}^{ut}_j + \hat{\textbf{z}}^{uy}_j)$.
    \item \emph{Mean-T} (only applicable when $K = J$) computes the average\footnote{In \cite{ADDVAE:2022}, \emph{sum} is used. We empirically found that \emph{sum} and \emph{mean} lead to similar results.} of two \emph{transformed versions} of interest factors and sharing the final factors for both channels, i.e., $\tilde{\textbf{z}}^{uy}_k = \frac{1}{2}(\hat{\textbf{z}}^{uy}_k + \hat{\textbf{z}}^{ut}_k) = \tilde{\textbf{z}}^{ut}_k$ ($k$ in place of $j$ for text factors). 
\end{itemize}
% Evidently, our adaptive fusion outperforms Mean-T.
% First, learning personalized fusion weights is beneficial as each user’s decisions vary between ratings and texts and thus, adopting equal weights like in Mean-T is less effective. Second, sharing interest representations for ratings and texts, as in Mean-T, is overly restrictive, while \ourmethod’s fusion method offers a greater flexibility in capturing user preferences from both modalities.
Our adaptive fusion outperforms Mean-T, demonstrating two key advantages. First, personalized fusion weights better capture user-specific preferences, as equal weights (Mean-T) fail to account for variability in how users weigh ratings versus texts. Second, Mean-T’s shared representations are overly restrictive, while \ourmethod’s flexible fusion effectively models preferences across modalities.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \input{figures/epsilon/epsilon}
\input{table/epsilon/epsilon}
\textbf{Effect of $\epsilon$ in Algorithm \ref{alg:sinkhorn_alg}. } Table \ref{tab:epsilon_reg} shows the results. The effect of $\epsilon$ is data-dependent. 
On CiteULike-a, Cell Phones and MovieLens, $\epsilon \geq 0.1$ leads to higher recommendation accuracy than smaller ones $\epsilon < 0.1$.
Pertaining to Video Games, $\epsilon < 0.1$ generally results in higher accuracy.
These observations imply that $\epsilon$ should be chosen carefully to produce favorable recommendation accuracy on each dataset.

% $\epsilon$ controls the sparsity of $\pi^u$, i.e., small $\epsilon$ results in highly skewed distribution in $\pi^u$ while large $\epsilon$ leads to roughly uniform distribution in $\pi^u$. 
% Thus, $\epsilon$ also controls the \emph{interpretability} of $\pi^u$. 
% That is, small $\epsilon$ generates a nearly one-to-one mapping between rating and text factors and thus, we can explain one rating factor by the matched text factor. Contrarily, large $\epsilon$ produces roughly one-to-many mapping between rating and text factors, which make it difficult when one would like to explain a rating factor via textual content. Figure \ref{fig:epsilon} shows the results. We observe that the effect of $\epsilon$ is data-dependent. 
% On CiteULike-a, Cell Phones and MovieLens, moderate $\epsilon$, i.e., $\epsilon \geq 0.2$, leads to higher recommendation accuracy than smaller ones, i.e., $\epsilon < 0.2$, indicating that there is a trade-off between recommendation and interpretability as the later one favors small $\epsilon$. Pertaining to Video Games, $\epsilon < 0.2$ generally results in higher accuracy, indicating the correlation between interpretability and accuracy. 
% As $\epsilon$'s value is data-dependent, it requires careful analysis to achieve good performance. 
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \input{figures/case_study/case_study_epsilon}
% \textbf{Effect of $\epsilon$ on alignment matrix $\pi^u$}
% Theoretically, small $\epsilon$ results in sparse $\pi^u$ while large $\epsilon$ leads to roughly uniform $\pi^u$. 
% % Note that sparse alignment is favorable to interpretability as it mimics approximately one-to-one mapping between rating and text factors. 
% To verify, we visualize the alignment probabilities produced by our model \ourmethod\ w.r.t. various $\epsilon$ in Figure ~\ref{fig:epsilon_case_study}. Obviously, small $\epsilon$ results in staggered pattern in alignment distribution while large $\epsilon$ leads to roughly uniform distribution. 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \input{supplementary/num_factors/num_factors}
% \input{table/raing_factor/rating_factor}
% \input{table/text_factor/text_factor}
% \textbf{Analysis on the Number of Interest Factors. }\label{sec:num_factors}
% We report recommendation accuracy w.r.t. the numbers of rating factors $K$ and the numbers of text factors $J$ in Table \ref{tab:rating_factor} and Table \ref{tab:text_factor}, respectively. 
% \emph{For rating factors}, setting $K = 3$ or $K = 4$ gives the best accuracy on CiteULike-a while $K \geq 5$ is ideal for Cell Phones. These evidences show that users have varied interests. \ourmethod\ generally performs best on Video Games with $K \geq 4$ while reaches its peak performance on MovieLens with $K = 4$.
% \emph{Pertaining to text factors}, setting $J = 4$ results in the higher accuracy on CiteULike-a and MovieLens. In contrast, \ourmethod's performance on Cell Phones is not highly sensitive to $J$. For Video Games, setting $J \leq 4$ produces better accuracy than larger values.

% It is worth to note that thanks to the pair-wise alignment between interest factors, \ourmethod\ can accommodate users’ distinctive behaviors across modalities, i.e., when $K$ and $J$ differ, while baseline such as \cite{ADDVAE:2022} cannot. As a result, \ourmethod\ offers greater flexibility and is more applicable.

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \input{figures/lambda_r/lambda_r}
% \input{figures/lambda_t/lambda_t}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \textbf{Effect of $\lambda_r$. }
% Figure ~\ref{fig:lambda_r} presents \ourmethod's accuracy w.r.t. $\lambda_r$, which controls the effect of regularization term for interest transfer between rating and text factors. First, we observe that setting $\lambda_r$ to $1$ or $0.5$ results in higher accuracy on chosen datasets. Second, the effect of $\lambda_r$ is data-dependent, e.g., while CiteULike-a favors large $\lambda_r$, the remaining datasets requires smaller value, i.e., around $0.5$ and $1$. An excessive value of $\lambda_r$ might cause detrimental effect.
% while $\lambda_r \geq 0.5$ is favorable showing that imposing a regularization term between interest factors from ratings and texts (as in Equation ~\ref{eqn:OT_reg}) works to some degree, $\lambda_r$ should be chosen carefully as an excessive value might cause detrimental effect.\looseness=-1 

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \textbf{Effect of $\lambda_t$. }
% Figure \ref{fig:lambda_t} presents the influence of $\lambda_t$, which controls the effect of text reconstruction objective, on \ourmethod's accuracy. First, setting $\lambda_t > 0$ leads to higher accuracy than setting $\lambda_t = 0$, underscoring the benefit of textual signals. Second, setting $\lambda_t$ to a moderate value, i.e., around $0.5$ or $1$, results in favorable accuracy across datasets.

% First, on each dataset, choosing a proper value of $\lambda_t$ results in higher accuracy than setting $\lambda_t = 0$, showing textual content reconstruction benefits \ourmethod's recommendation accuracy. This further confirms our hypothesis that transferring interest signals between rating and textual modalities is beneficial. Second, the influence of $\lambda_t$ varies across datasets and therefore, $\lambda_t$ should be chosen carefully to obtain higher Recall and NDCG.

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \input{table/running_time/running_time}
% \textbf{Efficiency Analysis.} Table \ref{tab:running_time} analyzes the efficiency of \ourmethod\ and two strongest baselines ADDVAE and VALID. For each model, we record the training time per epoch (in second) (averaged over ten runs) and the memory required for training (in GB). There are three key takeaways. First, \ourmethod\ maintains a comparable efficiency level yet achieves higher recommendation accuracy than ADDVAE and VALID. Second, the training time and memory gaps between VALID and \ourmethod\ come from textual content modeling component, i.e., text channel, in \ourmethod\ yet do not appear in VALID. Third, despite both including a textual content modeling module, \ourmethod\ employs multiple prototype updates in rating encoder while ADDVAE does not, which results in difference in efficiency level of these two models. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \input{figures/prototype_update_rating/prototype_update_rating}
% \input{table/rating_iter/rating_iter}
% \textbf{Effect of $L^y$.}\label{sec:rating_prototype_update}
% Table \ref{tab:rating_iter} shows \ourmethod's recommendation accuracy w.r.t. $L^y$, the number iterations to update rating prototypes in rating encoder $\bm{\mathcal{E}}^y$. Obviously, using more than one prototype update steps leads to higher recommendation accuracy in all chosen datasets, which confirms the effectiveness of updating prototypes in rating encoder. This finding is consistent with \cite{VALID:2023}. Moreover, each data requires a specific value of $L^y$ to achieve favorable accuracy, e.g., $2$ on CiteULike-a, $4$ on Cell Phones and $3$ on Video Games and MovieLens. \looseness=-1

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% % \input{figures/prototype_update_text/prototype_update_text}
% \input{table/text_iter/text_iter}
% \textbf{Effect of $L^t$.}\label{sec:text_prototype_update}
% Table \ref{tab:text_iter} presents the influence of the number iterations $L^t$ to update text prototypes in text encoder $\bm{\mathcal{E}}^t$. The effect of $L^t$ is contrary to that of $L^y$, i.e., using more prototype update steps results in a reduction in \ourmethod's performance. We conjecture that as users' comprehensions of textual content, i.e., words and phrases, are roughly the same, and thus, imposing personalization into word clustering in text encoder via setting $L^t > 1$ causes a detrimental effect. As such, we fix $L^t = 1$ for all datasets through out the paper to maintain both effectiveness and efficiency.
% , i.e., $L^t = 1$ consumes less memory than $L^t > 1$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \input{figures/case_study/pi_u_676}
% \input{figures/case_study/case_study}
% \textbf{Case study. } To better understanding the alignment process in \ourmethod, we visualize an example of alignment matrix $\pi^u$ and decoder outputs of a user on Cell Phones dataset in Figure \ref{fig:case_study} with $K$ and $J$ are both set to $3$ for easier interpretation.
% The alignment matrix reveals a staggered pattern, suggesting a near one-to-one correspondence between interest factors from two modalities. 
% For instance, rating interest factor 3 aligns with text interest factor 1, as both relate to virtual reality devices. Top three items predicted by rating interest factor 3 include VR products, while text interest factor 1 includes relevant terms like \emph{virtual, reality, vr, glasses}.   
% Similarly, rating and text interest factors 2 are matched as both focusing on Samsung Galaxy Note 4 accessories. Rating interest factor 1, relating to Motorola mobile phones, corresponds with text interest factor 3, including relevant terms like \emph{moto, edition, bamboo}. Thus, words predicted by decoder are capable of capturing the semantic of rating interest factors.
% Next, we examine the influence of key hyper-parameters, $\epsilon$ and $\lambda_t$, on the interpretation of alignment matrix and text decoder outputs.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{figures/case_study/case_study_epsilon}
\textbf{Effect of $\epsilon$ on alignment matrix $\pi^u$.}
Theoretically, small $\epsilon$ results in sparse $\pi^u$ while large $\epsilon$ leads to roughly uniform $\pi^u$. 
Figure \ref{fig:epsilon_case_study} shows an illustrative example produced by \ourmethod, evidently confirming the theoretical influence of $\epsilon$.
% Obviously, small $\epsilon$ results in staggered pattern in alignment distribution while large $\epsilon$ leads to roughly uniform distribution. 
\input{table/case_study/alignment_entropy}
% To further verify, we report the entropy of $\pi^u \in \mathbb{R}^{K \times J}$ in Table \ref{tab:entropy_pi_u}, which is calculated as follows. $\pi^u$ is normalized row-wise so that all elements in each row follow a distribution, i.e., $\ddot{\pi}^u = \frac{\pi^u}{\sum_{j = 1}^J\pi^u_{:j}}$. Similarly, we also perform column-wise normalization as $\overline{\pi}^u = \frac{\pi^u}{\sum_{k = 1}^K\pi^u_{k:}}$. While row-wise normalization produces the distribution of text factors given rating factors, column-wise normalization indicates the distribution of rating factors given text factors. Thus, we take the bidirectional relationship between rating and text interest factors into account. Then, we calculate the entropy as $\mathcal{H}(\ddot{\pi}^u) = \frac{1}{K}\sum_{k=1}^K-\ddot{\pi}^u_{kj}log_2(\ddot{\pi}^u_{kj})$ and $\mathcal{H}(\overline{\pi}^u) = \frac{1}{J}\sum_{k=1}^K-\overline{\pi}^u_{kj}log_2(\overline{\pi}^u_{kj})$. Finally, reported numbers in Table \ref{tab:entropy_pi_u} are averaged over all users $\mathcal{H} = \frac{1}{M}\sum_{u}\frac{1}{2}(\mathcal{H}(\ddot{\pi}^u) + \mathcal{H}(\overline{\pi}^u))$. Evidently, small $\epsilon$ results in lower entropy values, indicating sparser alignment matrices produced by \ourmethod.
To further verify, we report the entropy of $\pi^u \in \mathbb{R}^{K \times J}$ in Table \ref{tab:entropy_pi_u}. The reported numbers are averaged of row-wise and column-wise entropy of $\pi^u$.
% $\pi^u$ is normalized row-wise to capture rating to text interest distribution, i.e., $\ddot{\pi}^u = \frac{\pi^u}{\sum_{j = 1}^J\pi^u_{:j}}$. Similarly, we also perform column-wise normalization as $\overline{\pi}^u = \frac{\pi^u}{\sum_{k = 1}^K\pi^u_{k:}}$ to capture text to rating interest distribution.
% . While row-wise normalization produces the distribution of text factors given rating factors, column-wise normalization indicates the distribution of rating factors given text factors. Thus, we take the bidirectional relationship between rating and text interest factors into account. 
% Then, we calculate the entropy as $\mathcal{H}(\ddot{\pi}^u) = \frac{1}{K}\sum_{k=1}^K-\ddot{\pi}^u_{kj}log_2(\ddot{\pi}^u_{kj})$ and $\mathcal{H}(\overline{\pi}^u) = \frac{1}{J}\sum_{k=1}^K-\overline{\pi}^u_{kj}log_2(\overline{\pi}^u_{kj})$. Finally, reported numbers in Table \ref{tab:entropy_pi_u} are averaged over all users $\mathcal{H} = \frac{1}{M}\sum_{u}\frac{1}{2}(\mathcal{H}(\ddot{\pi}^u) + \mathcal{H}(\overline{\pi}^u))$. 
Evidently, small $\epsilon$ results in lower entropy values, indicating sparser alignment matrices.
% produced by \ourmethod.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{figures/case_study/case_study}
\textbf{Case study. } To better understand the alignment process in \ourmethod, Figure \ref{fig:case_study} visualizes an example of decoder outputs corresponding to alignment matrix in Figure \ref{fig:epsilon_case_study}(a).
The alignment matrix in Figure \ref{fig:epsilon_case_study}(a) reveals a staggered correspondence between interest factors from two modalities. 
For instance, rating interest factor 3 aligns with text interest factor 1. This observation is consistent with Figure \ref{fig:case_study}, where top three items predicted by rating interest factor 3 include VR products, while text interest factor 1 includes relevant terms like \emph{virtual, reality, vr, glasses}. Similar interpretations can be made for other factors. This showcases \ourmethod’s ability to semantically align and interpret rating factors. 
% Similarly, rating and text interest factors 2 are matched as both focusing on Samsung Galaxy Note 4 accessories. Rating interest factor 1, relating to Motorola mobile phones, corresponds with text interest factor 3, including relevant terms like \emph{moto, edition, bamboo}. Thus, words predicted by decoder are capable of capturing the semantic of rating interest factors.
% Next, we examine the influence of key hyper-parameters, $\epsilon$ and $\lambda_t$, on the interpretation of alignment matrix and text decoder outputs.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{table/case_study/text_factor_similarity}
% \textbf{Effect of $\lambda_t$ on text decoder outputs. } The output of text decoder has an influence on the understanding of the relationship between ratings and texts, reflected by the distinctiveness of top words predicted by \ourmethod\ for each text interest factor. More distinctive top words make it easier for humans to interpret the meaning behind each factor, therfore gaining insights into user preferences. To quantify this, we use the top 10 words as representatives of each text interest factor. The similarity between text interest factors is the fraction of shared words between them. For a user $u$, top words representing text interest factors $i, j$ are $T^u_i, T^u_j$. We calculate the similarities between pairs of text interest factors for each user then report the averaged number over all $M$ users in Table \ref{tab:lambda_t_text_factor}. The reported numbers are calculated as $T = \frac{1}{M}\sum_{u}\frac{2}{J(J - 1)}\sum_{i, j: i \neq j}\frac{|T^u_i \cap T^u_j|}{|T^u_i|}$, in which $|\cdot|$ is the cardinality, $J$ is the number of text interest factor per user. Table \ref{tab:lambda_t_text_factor} shows that large $\lambda_t$ results in low similarities between text interest factors, which benefits the understanding of user interests as top predicted words are more distinctive.
\textbf{Effect of $\lambda_t$ on interpretability. }The text decoder’s output, particularly the distinctiveness of top words for each text interest factor, enhances interpretability of the relationship between ratings and texts. Distinct words make it easier to infer the meaning behind each factor, providing insights into user preferences. To quantify this, we measure similarity between text interest factors as the fraction of shared words in their top 10 terms. For each user, we compute pairwise similarities and report the average over all users in Table 5. Table \ref{tab:lambda_t_text_factor} shows that large $\lambda_t$ results in low similarities between text interest factors, which facilitates the understanding of user interests with more distinctive top predicted words.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \textbf{Efficiency analysis. } We compare the efficiency level w.r.t. training time per epoch and memory consumption of \ourmethod\ against those of two strongest baselines ADDVAE and VALID. Due to limited space, this experiment is presented in the appendix. The key observations are first, \ourmethod\ maintains a comparable efficiency level yet achieves higher recommendation accuracy than ADDVAE and VALID; second, the difference in efficiency level of \ourmethod\ and ADDVAE comes from the design of rating encoder; third, the textual modeling component results in a gap between training time and memory of \ourmethod\ and VALID.

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \textbf{Study the values of $K$ and $J$. } 
% We analyze \ourmethod's accuracy w.r.t. various numbers of rating factors $K$ and numbers of text factors $J$, which is shown in the appendix to save space. Key takeaways are first, $K$ and $J$ are data-dependent; second, \ourmethod\ is capable of dealing with the case when the number of user interests between two modalities differ while existing work ADDVAE is unable due to their overly strict assumption.\looseness=-1

% \emph{Due to limited space, we present the more ablative studies the supplementary material: training pseudo code, efficiency analysis, effect of $\epsilon$ in Equation \ref{eqn:regularized_OT}, effect of $K$ and $J$, effect of $\lambda_r$ and $\lambda_t$, number of prototype updates $L^y$ and $L^t$.}\looseness=-1