%\vspace*{-5cm}
\section{Extended related work}
\textbf{VAE-based disentangled representation learning. } Uncovering hidden explanatory factors behind data results in robust representations and enables modeling complex patterns underlying data \cite{RLReview:2013}. Variational AutoEncoder or VAE is a popular method offering representation disentanglement. Early works in this direction \cite{betavae:2017, betavae:2018, disen_factorize:2018, isolating_vae:2018, ChallengeDisenRL:2019} focus on dimension-level disentanglement, where each element in the representation vector captures a distinctive latent feature. Later, \cite{macridvae:2019, VALID:2023, FacetVAE:2024, DualVAE:2024} extend this line by disentangling user preferences not only at dimension level but also at intention level.
% , which results in a new standard for recommendation accuracy. 
Follow-up works incorporate various sources of information to improve disentangling user preferences. \cite{ADDVAE:2022, TopicVAE:2022} employs textual content while \cite{SemMacridVAE:2023} hires visual information. \cite{DGVAE:2024, AlignMacridVAE:2024} seek the rich knowledge behind multi-modal data, i.e., textual and visual features. \cite{curcodis:2023} integrates social relationships between users to better disentangle user preferences. Our work follows this line of research yet is distinctive in innovatively incorporating \emph{optimal transport} for aligning disentangled rating and text factors. While we mainly focus on rating and text data in this work, the proposed method is applicable when multi-modalities involve as elaborated in Section \ref{sec:extension}.

%%%%%%%%%%%%%%%%
\textbf{Textual content-aware recommendation. } 
Early methods \cite{ctr:2011, cdl:2015, convmf:2016, gate:2019} leverage deep neural networks to model item textual content, thereby enhancing recommendation performance. Later, VAE has been widely adopted for this task, both in non-disentangled \cite{VBAE:2023, cvae:2017, MDCVAE:2022} and disentangled fashions \cite{dicer:2020, ADDVAE:2022, TopicVAE:2022}.
% Our work aligns with this category. However, 
What distinguishes our work from these is the introduction of an optimal transport (OT)-based approach to align and fuse interest factors from ratings and textual content, which provides a more flexible and nuanced alignment between interest factors.
% By framing the alignment as an OT problem, our work provides a more flexible and nuanced method for aligning the multiple latent factors that represent user preferences, overcoming the limitations of previous fixed-alignment models.
Recently, pre-trained language models (PLMs), e.g., \cite{BERT:2019}, have been explored to generate text-based item representations for recommendation \cite{UniSRec:2022, BM3:2023, TIGER:2023}. While PLMs offer powerful text encodings, they tend to compress the entire content into a single vector, which ignores the intricate structure and multi-faceted nature of textual data. In contrast, our work focuses on disentangling multiple interest factors from textual content to capture a richer representation of user preferences. Thus, we leave the integration of PLMs into our framework as a direction for future work.
% as current PLM techniques do not align well with our intention of preserving the latent structure behind textual information. 
% Moreover, while some recent works, e.g., 
\cite{DGVAE:2024, AlignMacridVAE:2024} leverage textual and visual data for recommendation tasks, which differs significantly from ours, particularly in their use of multi-modal features. As a result, this work is not not directly comparable with our model, which focuses solely on aligning ratings and textual content.
Additionally, our work is related to hybrid recommender systems \cite{FM:2010, HybridSVD:2019, EASEr:2020, MDCF:2023}, which aim to tackle challenges like the cold-start problem by combining multiple data sources. However, our primary objective in this paper is to discover and align multiple interest factors across modalities in a warm-start setting, where sufficient interaction data is available. Addressing the cold-start problem, though relevant, falls outside the scope of this work.\looseness=-1
% as we aim to focus on improving recommendation accuracy by enhancing cross-modal alignment of interest factors in established user-item interactions.

%%%%%%%%%%%%%%%%%%%
\textbf{Optimal transport and its applications. } 
% Optimal Transport (OT) provides an elegant way to measure `distance` between two probability distributions and transport a point from a distribution to another \cite{computational_OT:2019}. Sinkhorn algorithm \cite{sinkhorn_distance:2013, AutoDiff:2018} is a widely adopted method to compute optimal transport plan 
% , which enables various applications. 
% \cite{OT_reglab:2014, OT_DA:2016} apply OT for domain adaptation. \cite{fusion_OT:2020} fuses different models' layers. \cite{Align_transformer:2021} aligns multiple query and key matrices in multi-head attention. \cite{Sinkformer:2022} improves attention matrix in Transformer. \cite{OTKGE:2022} improves knowledge graph modeling by fusing multimodal data via OT. \cite{ECRTM:2023} designs OT-inspired regularization term to improve topic modeling. \cite{MESH:2023} explores OT for object-centric learning. 
% Our novelty is distinct by aligning and fusing mutually disentangled user interests from ratings and texts for recommendation.\looseness=-1
Optimal Transport (OT) offers an elegant framework to measure the distance between two probability distributions and facilitates the transformation of points from one distribution to another \cite{computational_OT:2019}. The popular method for computing optimal transport plan is Sinkhorn algorithm \cite{sinkhorn_distance:2013, AutoDiff:2018}, which offers efficient and GPU-friendly framework and thus, has enabled numerous applications across various domains. For example, OT has been utilized in domain adaptation \cite{OT_reglab:2014, OT_DA:2016}, model fusion \cite{fusion_OT:2020} and attention-based models \cite{Align_transformer:2021, Sinkformer:2022}.
% , where it helps fuse different models' layers. 
% In the context of attention-based architectures, 
% Recently, OT has been employed to improve attention mechanism \cite{Align_transformer:2021}, \cite{Sinkformer:2022}. 
% to align query and key matrices in multi-head attention mechanisms \cite{Align_transformer:2021}, while Sinkformer \cite{Sinkformer:2022} leverages OT to enhance the attention matrix in transformers. 
Additionally, OT has demonstrated its effectiveness in fusing multi-modal knowledge graph data \cite{OTKGE:2022}, enhancing the coherence of topic modeling via regularization \cite{ECRTM:2023}, and improving object-centric learning \cite{MESH:2023}.
% knowledge graph modeling by fusing multi-modal data for richer representations \cite{OTKGE:2022}, as well as in topic modeling, where OT-inspired regularization improves the coherence of topic distributions \cite{ECRTM:2023}. 
% Furthermore, OT has recently been explored for object-centric learning  
% % which involves decomposing scenes into distinct objects 
% \cite{MESH:2023}. 
In recommender systems, OT has also been widely explored, e.g., aggregating non-local information in graph-based recommendation \cite{GOTNet:2022}, or finding user correspondence in cross-domain recommendation setting \cite{UDMCF:2024}.
Our work builds on OT but adopts an orthogonal approach. 
Specifically, we apply OT to align and fuse mutually disentangled user interest factors derived from ratings and textual content.
% to improve the accuracy and expressiveness of textual content-aware recommendation models. 
By leveraging OT to perform this alignment in a data-driven manner, our approach allows for a more flexible and personalized representation of user preferences across modalities, leading to higher recommendation accuracy and offering a feasible method to gain insights into the relationship between user interactions and textual content.
% , distinguishing our method from existing work that focuses primarily on direct alignment or fusion without accounting for the nuanced, multi-dimensional nature of interest factors in different data types. 
% This innovative use of OT not only enhances recommendation accuracy but also offers a feasible method to gain insights into the relationship between user interactions and textual content.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Text Encoder $\mathcal{E}^t$}
\textbf{Text encoder}\label{sec:text_encoder_full}
$\bm{\mathcal{E}}^t$ functions similarly to the rating encoder $\mathcal{E}^y$, but its input is the textual content $\textbf{t}^u$ associated with user $u$.
$\mathcal{E}^t$ also leverages prototypes, $\textbf{m}^t \in \mathbb{R}^{J \times d}$, to cluster words into $J$ groups, each representing one user interest from texts. In general, this process can also run iteratively for $L^t$ iterations. However, we empirically found that employing iteratively clustering process inside $\mathcal{E}^t$, i.e., $L^t > 1$, does not show clear improvement. Thus, to maintain efficiency, we set $L^t = 1$. As such, there is no prototype updating inside $\mathcal{E}^t$ and the set of text prototypes $\textbf{m}^t$ is shared among users. Then, we calculate the word-cluster assignment matrix $\textbf{A}^{ut} \in \mathbb{R}^{W \times J}$

\begin{equation}\label{eqn:text_prototype_update}
    \begin{gathered}
        \small
        \textbf{A}^{ut} = \eta(\frac{\textbf{E}\cdot (\textbf{m}^{ut})^T}{\tau \cdot ||\textbf{E}||_2 \cdot ||\textbf{m}^{ut}||_2})
    \end{gathered}
\end{equation}
Similar to rating encoder $\bm{\mathcal{E}}^y$, $\eta$ is Gumbel-Softmax.
Next, we estimate two parameters of Gaussian distribution for text interest factor $j$ as $\bm{\mu}^{ut}_j$ = $\frac{\textbf{r}^{ut}_j}{||\textbf{r}^{ut}_j||_2}$, $\bm{\sigma}^{ut}_j$ = $\sigma^t \cdot exp(-\frac{1}{2}\textbf{o}^{ut}_j)$ where
\begin{equation}\label{eqn:text_factor_j_repr}
    \begin{gathered} 
        \small
        (\textbf{r}^{ut}_j, \textbf{o}^{ut}_j) = \textbf{W}'_2 tanh(\textbf{W}'_1norm(\textbf{A}^{ut}_{:j} \odot \textbf{t}^u) + \textbf{b}'_1) + \textbf{b}'_2
        % \\
        % \bm{\mu}^{ut}_j = \frac{\textbf{r}^{ut}_j}{||\textbf{r}^{ut}_j||_2}; \hspace{2mm} \bm{\sigma}^{ut}_j = \sigma^t \cdot exp(-\frac{1}{2}\textbf{o}^{ut}_j) 
    \end{gathered}
\end{equation}
$\odot$ and $norm(\cdot)$ are the same as in rating encoder. $\textbf{W}'_1 \in \mathbb{R}^{W \times D}, \textbf{b}'_1 \in \mathbb{R}^{D}, \textbf{W}'_2 \in \mathbb{R}^{D \times 2d}, \textbf{b}'_2 \in \mathbb{R}^{2d}$ are weight matrices and bias vectors of text encoder. 
% Note that these learnable parameters are distinct from those of $\bm{\mathcal{E}}^y$. 
$\sigma^t$'s value is around 0.1.
Then $j^{th}$ text factor is sampled as $\textbf{z}^{ut}_j \sim \mathcal{N}(\bm{\mu}^{ut}_j, [diag(\bm{\sigma}^{ut}_j)]^2)$, which is repeated $\forall{j=1, 2, ..., J}$.  
Assuming the independence between text factors of user $u$, we have
$q(\textbf{z}^{ut}|\textbf{t}^u, \textbf{A}^{ut}) = \prod_{j=1}^J \mathcal{N}(\bm{\mu}^{ut}_j, [diag(\bm{\sigma}^{ut}_j)]^2)$ as the variational distribution, which is then aligned with prior distribution $p(\textbf{z}^{ut}) = \mathcal{N}(\textbf{0}, (\sigma^t)^2\textbf{I})$ via Kullback-Leibler divergence ($D^t_{KL}$)
% . As $p(\textbf{z}^{ut})$ is a factorized distribution, optimizing $D^t_{KL}$ also 
to impose micro-disentanglement.
% , i.e., disentanglement between dimensions of representation sampled from $q(\textbf{z}^{ut}|\textbf{t}^u, \textbf{A}^{ut})$. We add the regularization term $D^t_{KL}(q(\textbf{z}^{ut}|\textbf{t}^u, \textbf{A}^{ut}) || p(\textbf{z}^{ut}))$ into Equation \ref{eqn:text_objective} for optimization.\looseness=-1

\emph{In summary}, text encoder $\mathcal{E}^t$ produces $J$ text interest factors $\textbf{z}^{ut}$ = $\{\textbf{z}^{ut}_j\}_{j=1}^J$, assignment matrix $\textbf{A}^{ut}$, and regularization term $D^t_{KL}(q(\textbf{z}^{ut}|\textbf{t}^u, \textbf{A}^{ut}) || p(\textbf{z}^{ut}))$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \section{Data preprocessing}
% For CiteULike-a, Cell Phones and Video Games datasets, we use the accompanying textual content, i.e., title \& abstract for CiteULike-a and item descriptions for Amazon categories. For Cell Phones, we retain users with at least $8$ interactions and items with at least $5$ interactions and for Video Games, these numbers are 5 and 5, respectively. For MovieLens, we follow \cite{MDCVAE:2022} to extract a subset of users from ML-10M version. We keep user ratings larger than $3$ as interactions \cite{macridvae:2019} and collect item textual content from IMDB \footnote{https://datasets.imdbws.com/}. For all datasets, we remove stop words and only keep words with frequency higher than $3$ and appearing in less than $60\%$ of item texts and retain top $8k$ words with highest frequency as in \cite{MDCVAE:2022}. These strategies help ensure that even short or noisy item descriptions contribute meaningful information. Moreover, these steps are employed across baselines, ensuring fair comparison. We keep these pre-processing steps at minimal complexity so that the performance gain is attributed to our proposed aligning mechanism. Employing advanced methods to generate clean text would potentially enhance our proposed framework.\looseness=-1
% % , top $8k$ words with highest frequency are retained to construct vocabulary. 
% % \input{table/data_stats/data_stats}

% We adopt \emph{strong generalization} setting as in \cite{macridvae:2019} to construct training, validation and test sets by randomly choosing $80\%$ of users for training and $10\%$ of users for each validation and test sets. For validation and test sets, $20\%$ of a user interactions is kept as the ground truth. To keep the quality of datasets, we only retain items with at least $5$ words in their textual content so that the textual content brings semantic information. All cold-start items, i.e., those do no appear in training set, are discarded since there is no parameters associating with them, following the common practice in the field.
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \section{Baseline description}
% \begin{itemize}[leftmargin=*]
%     \item \textbf{MacridVAE} \cite{macridvae:2019} introduces macro- and micro-disentanglement of user preferences via multi-prototype representation and independence regularization.\looseness=-1
%     \item \textbf{RecVAE} \cite{RecVAE:2020} proposes composite prior, rescaling regularization term and an alternative training into a novel VAE-based recommendation model. 
%     \item \textbf{MDCVAE} \cite{MDCVAE:2022} regularizes decoder weights of the user-oriented autoencoder by latent embeddings inferred from textual content.
%     \item \textbf{TopicVAE} \cite{TopicVAE:2022} improves disentangling user preferences by designing attention-based topic extraction from textual content, topic-guided contrastive loss and heuristic method to set value of regularization term.
%     \item \textbf{ADDVAE} \cite{ADDVAE:2022} leverages two disentangled networks to model user's ratings and user associated texts then aligns disentangled factors from these two modalities using compositional de-attention and regularization.
%     \item  \textbf{ELSA} \cite{ELSA:2022} improves SOTA linear autoencoder by factorizing hidden space into a low-rank plus sparse structure.\looseness=-1
%     \item \textbf{SEM-MacridVAE} \cite{SemMacridVAE:2023} exploits semantic knowledge from side information to improve VAE-based disentangled recommendation models. We use tf-idf item-word matrix, i.e., $\textbf{W} = \{\textbf{w}^i\}_{i=1}^N$, as side information for fair comparison.
%     \item \textbf{VALID} \cite{VALID:2023} improves VAE-based disentangling user interests by iterative latent attention and implicit differentiation.\looseness=-1
%     \item  \textbf{FacetVAE} \cite{FacetVAE:2024} disentangles multi-faceted item space and derive compositional user interests via bi-directional binding.\looseness=-1
% \end{itemize}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \section{Implementation details}
% For all models, we choose the hyper-parameters based on performance on validation set. Then, we retrain and report performance on test set, which is averaged over ten runs on NVIDIA RTX 2080 Ti GPU machine. 
% Pertaining to baselines, we follow their original papers to choose hyper-parameters by performing grid search in the same range described in those papers. 
% Regarding \ourmethod, the default settings are $D = 300$ for MovieLens and Cell Phones and $D=600$ for CiteULike-a and Video Games after tuning from \{100, 200, 300, 500, 600\}; embedding size $d = 100$ for all datasets; dropout rate applied for $\textbf{A}^{uy}$ and $\textbf{A}^{ut}$ is $0.5$; number of rating and text factors are $K = 4$ and $J = 4$, respectively (more values of $K$ and $J$ are analyzed in subsequenct sections; $\beta^y$ and $\beta^t$ follow annealing process $min(\beta_0, \frac{update}{T})$ where $\beta_0 = 1$ for rating channel and $\beta_0 = 0.2$ for text channel, $T$ is chosen from $\{1k, 5k, 10k, 20k\}$, and $update$ is the number parameter updates; $\sigma^y$ and $\sigma^t$ are chosen from $\{0.05, 0.075, 0.1\}$; the search space of $\lambda_t$ and $\lambda_r$ is $\{0.1, 0.2, 0.5, 1, 2, 5\}$; $\epsilon \in \{0.2, 0.5, 1\}$ in Sinkhorn algorithm. Archiecture of fusion network $\zeta: 2d \rightarrow d/2 \rightarrow 1$. The number of prototype update steps $L^y$ in rating encoder are chosen from $\{2, 3, 4\}$ while $L^t = 1$. We train \ourmethod\ using Adam optimizer with learning rate $0.001$ on  NVIDIA RTX 2080 Ti GPU machine. Training stops after $30$ epochs without improving performance on validation set. 
% We report Recall and NDCG at top 10 and 50 with full-ranking strategy \cite{recsys_eval_setting:2020}, i.e., test item is ranked against all items to avoid sampling bias.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Additional results}
\textbf{Analysis on the Number of Interest Factors. }\label{sec:num_factors}
We report recommendation accuracy w.r.t. the numbers of rating factors $K$ and the numbers of text factors $J$ in Table \ref{tab:rating_factor} and Table \ref{tab:text_factor}, respectively. 
\emph{For rating factors}, setting $K = 3$ or $K = 4$ gives the best accuracy on CiteULike-a while $K \geq 5$ is ideal for Cell Phones. These evidences show that users have varied interests. \ourmethod\ generally performs best on Video Games with $K \geq 4$ while reaches its peak performance on MovieLens with $K = 4$.
\emph{Pertaining to text factors}, setting $J = 4$ results in the higher accuracy on CiteULike-a and MovieLens. In contrast, \ourmethod's performance on Cell Phones is not highly sensitive to $J$. For Video Games, setting $J \leq 4$ produces better accuracy than larger values.

It is worth to note that thanks to the pair-wise alignment between interest factors, \ourmethod\ can accommodate users’ distinctive behaviors across modalities, i.e., when $K$ and $J$ differ, while baseline such as \cite{ADDVAE:2022} cannot. As a result, \ourmethod\ offers greater flexibility and is more applicable.
\input{table/raing_factor/rating_factor}
\input{table/text_factor/text_factor}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\textbf{Effect of $\lambda_r$. }
Figure ~\ref{fig:lambda_r} presents \ourmethod's accuracy w.r.t. $\lambda_r$, which controls the effect of regularization term for interest transfer between rating and text factors. First, we observe that setting $\lambda_r$ to $1$ or $0.5$ results in higher accuracy on chosen datasets. Second, the effect of $\lambda_r$ is data-dependent, e.g., while CiteULike-a favors large $\lambda_r$, the remaining datasets requires smaller value, i.e., around $0.5$ and $1$. An excessive value of $\lambda_r$ might cause detrimental effect.
\input{figures/lambda_r/lambda_r}
% while $\lambda_r \geq 0.5$ is favorable showing that imposing a regularization term between interest factors from ratings and texts (as in Equation ~\ref{eqn:OT_reg}) works to some degree, $\lambda_r$ should be chosen carefully as an excessive value might cause detrimental effect.\looseness=-1 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\textbf{Effect of $\lambda_t$. }
Figure \ref{fig:lambda_t} presents the influence of $\lambda_t$, which controls the effect of text reconstruction objective, on \ourmethod's accuracy. First, setting $\lambda_t > 0$ leads to higher accuracy than setting $\lambda_t = 0$, underscoring the benefit of textual signals. Second, setting $\lambda_t$ to a moderate value, i.e., around $0.5$ or $1$, results in favorable accuracy across datasets.
\input{figures/lambda_t/lambda_t}

% First, on each dataset, choosing a proper value of $\lambda_t$ results in higher accuracy than setting $\lambda_t = 0$, showing textual content reconstruction benefits \ourmethod's recommendation accuracy. This further confirms our hypothesis that transferring interest signals between rating and textual modalities is beneficial. Second, the influence of $\lambda_t$ varies across datasets and therefore, $\lambda_t$ should be chosen carefully to obtain higher Recall and NDCG.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{table/running_time/running_time}
\textbf{Efficiency Analysis.} Table \ref{tab:running_time} analyzes the efficiency of \ourmethod\ and two strongest baselines ADDVAE and VALID. For each model, we record the training time per epoch (in second) (averaged over ten runs) and the memory required for training (in GB). There are three key takeaways. First, \ourmethod\ maintains a comparable efficiency level yet achieves higher recommendation accuracy than ADDVAE and VALID. Second, the training time and memory gaps between VALID and \ourmethod\ come from textual content modeling component, i.e., text channel, in \ourmethod\ yet do not appear in VALID. Third, despite both including a textual content modeling module, \ourmethod\ employs multiple prototype updates in rating encoder while ADDVAE does not, which results in difference in efficiency level of these two models. Though \ourmethod\ introduces added complexity, this is the cost of modeling richer, user-aware cross-modal relationships. As shown in Tables \ref{tab:recom_result}, this complexity leads to clear performance gains.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{table/rating_iter/rating_iter}
\textbf{Effect of $L^y$.}\label{sec:rating_prototype_update}
Table \ref{tab:rating_iter} shows \ourmethod's recommendation accuracy w.r.t. $L^y$, the number iterations to update rating prototypes in rating encoder $\bm{\mathcal{E}}^y$. Evidently, using more than one prototype update steps leads to higher recommendation accuracy in all chosen datasets, which confirms the effectiveness of updating prototypes in rating encoder. This finding is consistent with \cite{VALID:2023}. Moreover, each data requires a specific value of $L^y$ to achieve favorable accuracy, e.g., $2$ on CiteULike-a, $4$ on Cell Phones and $3$ on Video Games and MovieLens. \looseness=-1

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \input{figures/prototype_update_text/prototype_update_text}
\input{table/text_iter/text_iter}
\textbf{Effect of $L^t$.}\label{sec:text_prototype_update}
Table \ref{tab:text_iter} presents the influence of the number iterations $L^t$ to update text prototypes in text encoder $\bm{\mathcal{E}}^t$. The effect of $L^t$ is contrary to that of $L^y$, i.e., using more prototype update steps results in a reduction in \ourmethod's performance. We conjecture that as users' comprehensions of textual content, i.e., words and phrases, are roughly the same, and thus, imposing personalization into word clustering in text encoder via setting $L^t > 1$ causes a detrimental effect. As such, we fix $L^t = 1$ for all datasets through out the paper to maintain both effectiveness and efficiency.