%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{figures/architecture/architecture}
\textbf{Preliminaries and notations. }
Our setting includes $M$ users indexed by $u$, and $N$ items indexed by $i$. 
% The user-item interactions are stored in $\textbf{R} \in \{0, 1\}^{M \times N}$. 
For user $u$, let $\textbf{y}^u \in \{0, 1\}^N$ be their historical interactions with items.
% $\textbf{y}^u$ is the $u^{th}$ row of $\textbf{R}$. 
$\textbf{y}^u_i = 1$ indicates an observed interaction between $u$ and $i$, otherwise $\textbf{y}^u_i = 0$. For item $i$, let $\textbf{w}^i \in \mathbb{R}^W$ be the tf-idf representation of its textual content. $W$ is the number of words in the vocabulary. Let $\textbf{t}^u \in \mathbb{R}^W$ be textual vector of user $u$, obtained from their adopted items as $\textbf{t}^u = \frac{\sum_{i}{\textbf{y}^u_i\textbf{w}^i}}{\sum_{i}\textbf{y}^u_i}$. Let $\textbf{H} \in \mathbb{R}^{N \times d}$ be the embedding matrix of $N$ items, which is the weight of decoder of rating channel in Figure \ref{fig:main_architecture}. The encoder of rating channel is a two-layered Multilayer Perceptron (MLP).
% used to derive user representations mined from $\textbf{y}^u$. 
Inside the text channel in Figure \ref{fig:main_architecture}, the weight of decoder is denoted by $\textbf{E}^{W \times d}$, which stores $W$ $d$-dimensional vectors of $W$ words in the vocabulary. The encoder of text channel includes another two-layered Multilayer Perceptron (MLP) module.
% , which is used to derive user representations uncovered from $\textbf{t}^u$. 
Our initial exploration leveraged a BERT-style pre-trained language model (PLM) to generate initial $\textbf{H}$ and $\textbf{E}$ from textual content but did not produce favorable recommendation accuracy. Thus, we do not include PLM for fair comparison with baselines and leave the integration of pre-trained models like CLIP or Large Language Models for a future study.\looseness=-1

% Our goal is to reveal user preferences underlying rating and textual modalities, denoted by $\textbf{y}^u$ and $\textbf{t}^u$, respectively.
% To achieve this, we seek factorized user representations from $\textbf{y}^u$ and $\textbf{t}^u$, denoted as $\textbf{z}^{uy}$ and $\textbf{z}^{ut}$, respectively. Concretely, $\textbf{z}^{uy} = \{\textbf{z}^{uy}_{k}\}_{k=1}^K$ assuming $K$ rating interest factors underlying $\textbf{y}^u$. 
% Similarly, $\textbf{z}^{ut} = \{\textbf{z}^{ut}_{j}\}_{j=1}^J$ consists of $J$ text interest factors behind $\textbf{t}^u$. Next, we align these rating and text factors via optimal transport. The target is two-fold. For one, aligning and fusing interest factors increases their expressiveness thanks to blending user preferences from two modalities.
% For another, mapping rating factors onto text space improves interpretability as textual content is human-understandable.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Overview of \ourmethod}\label{sec:overview}
Figure \ref{fig:main_architecture} illustrates our model \ourmethod, which discovers
user preferences from ratings $\textbf{y}^u$ and texts $\textbf{t}^u$ for a user $u$.
% Thus, we seek factorized user representations denoted as $\textbf{z}^{uy}$ and $\textbf{z}^{ut}$. 
Concretely, $\textbf{z}^{uy}$ = $\{\textbf{z}^{uy}_{k}\}_{k=1}^K$ assuming $K$ rating interest factors underlying $\textbf{y}^u$. 
Similarly, $\textbf{z}^{ut}$ = $\{\textbf{z}^{ut}_{j}\}_{j=1}^J$ consists of $J$ text interest factors behind $\textbf{t}^u$. Then, we align these rating and text factors via optimal transport, leveraging cross-modal interest signals to improve performance. Like previous VAE-based multi-interest modeling studies, \ourmethod\ includes three main components: \textbf{a) Encoder $\bm{\mathcal{E}}$} derives $K$ rating interest factors and $J$ text interest factors for each user; \textbf{b) Alignment module $\bm{\mathcal{A}}$} aligns and fuses user interest factors from ratings and texts;
\textbf{c) Decoder $\bm{\mathcal{D}}$} reconstructs observed user-item ratings and user associated texts. The key difference in \ourmethod\ lies in its novel adaptation of optimal transport for aligning and fusing cross-modal interest factors, which will be elaborated in the next section.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{User Interest Learning}\label{sec:encoder}
%%%%%%%%%%%%%%%%%%%%
\textbf{Rating encoder $\bm{\mathcal{E}}^y$}\label{sec:rating_encoder}. To model multiple user interests, we aim at uncovering the structure of their interacted items. Inspired by \cite{VALID:2023}, we employ prototype-based clustering to group user's interacted items into clusters, each capturing one user interests. To implement, $\mathcal{E}^y$ employs a set of $K$ prototypes $\textbf{m}^y \in \mathbb{R}^{K \times d}$, which are equivalent to cluster centroids. 
The clustering process runs iteratively for $L^y$ iterations (indexed by $l$). Each iteration $l = 1, 2, ..., L^y$ computes item-cluster assignment matrix $\textbf{A}^{uy}_l \in \mathbb{R}^{N \times K}$ then updates $K$ prototypes (indexed by $k$) as
\begin{equation}\label{eqn:prototype_update}
    \begin{gathered}
        \resizebox{0.9\columnwidth}{!}{$
        \textbf{A}^{uy}_{l} = \eta(\frac{\textbf{H}\cdot (\textbf{m}^{uy}_{l})^T}{\tau \cdot ||\textbf{H}||_2 \cdot ||\textbf{m}^{uy}_{l}||_2}) 
        \Longrightarrow
        \textbf{m}^{uy}_{lk} = \sum_{i}\textbf{y}^u_i(\textbf{A}^{uy}_{l})_{ik}\textbf{H}_i
        $}
    \end{gathered}
\end{equation}
$\textbf{m}^{uy}_l$=$\{\textbf{m}^{uy}_{lk}\}_{k=1}^K$ and $\textbf{m}^{uy}_1$=$\textbf{m}^y$.
We implement $\eta$ as the widely adopted Gumbel-Softmax \cite{gumbel_softmax:2017, Concrete_dist:2017} to ensure fair comparison with baselines. For an item $i$, its assignment score towards $K$ clusters satisfies: $\sum_{k=1}^K(\textbf{A}^{uy}_{l})_{ik}=1$ and $(\textbf{A}^{uy}_{l})_{ik} \geq 0$. $\textbf{A}^{uy}_{l}$ is based on cosine similarity between $\textbf{H}$ and $\textbf{m}^{uy}_l$. $\tau$ is a small number to concentrate weights on the most probable prototype.
While iteratively updating $\textbf{m}^{uy}_l$ in Equation \ref{eqn:prototype_update} leads to more informative prototypes than randomly initialized $\textbf{m}^y$, it creates a recurrent network, which is difficult to train. Thus, we apply Implicit Differentiation by stopping gradient ($\mathrm{sg}$) update to prototypes after $L^y-1$ iterations, i.e., $\textbf{m}^{uy}_{L^y-1} = \mathrm{sg}(\textbf{m}^{uy}_{L^y-1})$, then obtain assignment matrix $\textbf{A}^{uy}_{L^y}$ as in Equation \ref{eqn:prototype_update}. For simplicity, we omit index $L^y$ hereafter.

Next, we estimate the parameters of Gaussian distribution for each interest factor $k$ via rating encoder's MLP 
\begin{equation}
    (\textbf{r}^{uy}_k, \textbf{o}^{uy}_k) = \textbf{W}_2 tanh(\textbf{W}_1norm(\textbf{A}^{uy}_{:, k} \odot \textbf{y}^u) + \textbf{b}_1) + \textbf{b}_2
\end{equation}
where $\odot$ is element-wise multiplication. $norm(\textbf{x}) = \textbf{x} / ||\textbf{x}||_2$ normalizes input to unit-length vector. $\textbf{W}_1 \in \mathbb{R}^{N \times D}, \textbf{b}_1 \in \mathbb{R}^{D}, \textbf{W}_2 \in \mathbb{R}^{D \times 2d}, \textbf{b}_2 \in \mathbb{R}^{2d}$ are weight matrices and bias vectors. 
Finally, the $k$-th rating interest factor is sampled as $\textbf{z}^{uy}_k \sim \mathcal{N}(\bm{\mu}^{uy}_k, [diag(\bm{\sigma}^{uy}_k)]^2)$. 
$\bm{\mu}^{uy}_k$=$\frac{\textbf{r}^{uy}_k}{||\textbf{r}^{uy}_k||_2}; \hspace{2mm} \bm{\sigma}^{uy}_k$=$\sigma^y \cdot exp(-\frac{1}{2}\textbf{o}^{uy}_k)$ and $\sigma^y$ is around 0.1 \cite{macridvae:2019}.
Assuming the independence between rating factors of user $u$, we have
$q(\textbf{z}^{uy}|\textbf{y}^u, \textbf{A}^{uy}) = \prod_{k=1}^K \mathcal{N}(\bm{\mu}^{uy}_k, [diag(\bm{\sigma}^{uy}_k)]^2)$,
as variation distribution, which is aligned with prior distribution $p(\textbf{z}^{uy}) = \mathcal{N}(\textbf{0}, (\sigma^y)^2\textbf{I})$ via Kullback-Leibler divergence $D^y_{KL}$. Following the common practice in prior studies \cite{macridvae:2019, TopicVAE:2022}, we omit the VAE prior during evaluation for stability and comparability.

\emph{In summary}, rating encoder $\mathcal{E}^y$ produces $K$ rating interest factors $\textbf{z}^{uy}$=$\{\textbf{z}^{uy}_k\}_{k=1}^K$, assignment matrix $\textbf{A}^{uy}$ and regularization term $D^y_{KL}(q(\textbf{z}^{uy}|\textbf{y}^u, \textbf{A}^{uy}) || p(\textbf{z}^{uy}))$.

%%%%%%%%%%%%%%%%%%%%
\textbf{Text encoder $\bm{\mathcal{E}}^t$}\label{sec:text_encoder} clusters words into $J$ groups, each representing one user interest from texts.
$\bm{\mathcal{E}}^t$ functions similarly to rating encoder $\mathcal{E}^y$, but accepts different inputs: user $u$'s textual content $\textbf{t}^u$, prototypes $\textbf{m}^t \in \mathbb{R}^{J \times d}$, word embedding $\textbf{E} \in \mathbb{R}^{W \times d}$, the number of clustering iterations $L^t$.
To save space, we present the details in the appendix. 

\emph{In summary,} $\bm{\mathcal{E}}^t$ produces $J$ text interest factors $\textbf{z}^{ut}$=$\{\textbf{z}^{ut}_j\}_{j=1}^J$, assignment matrix $\textbf{A}^{ut}$, regularization term $D^t_{KL}(q(\textbf{z}^{ut}|\textbf{t}^u, \textbf{A}^{ut}) || p(\textbf{z}^{ut}))$ with $ p(\textbf{z}^{ut}) = \mathcal{N}(\textbf{0}, (\sigma^t)^2\textbf{I})$.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Interest Factor Alignment}\label{sec:OT_alignment_module}
% Our goal is to align rating and text interest factors of each user to leverage cross-modal interest signals, improving recommendation accuracy. For example, user interest towards phone case discovered from texts should be aligned and fused with phone case interest mined from ratings. 
% However, this alignment is unavailable in data. Thus, it requires to effectively discover interest alignment without direct supervision. In addition, the transference of interest signals between these factors is also crucial to fully utilize cross-modal interest signals. To tackle, we regard rating and text interest factors as supporting points of two discrete distributions and frame their alignment as an optimal transport (OT) problem.
% This setup enables to adaptively learn the alignment between rating and text interest factors from data, avoiding the rigid one-to-one correspondence which might lead to sub-optimal performance. 
% Furthermore, the computed alignment also facilitates mutual transference between cross-modal interest factors, effectively improving the quality of user interest representations. Besides improving recommendation accuracy, finding the correspondence between rating and text interest factors enables to gain insights into the relationship between user ratings and textual content.\looseness=-1
Our goal is to align and fuse user interest factors derived from ratings and texts to enhance recommendation accuracy. For instance, 
% a user’s interest in phone cases inferred from text should align with the corresponding interest mined from ratings. 
the encoder extracts the headphone interest from ratings and the alignment module aligns this headphone interest to its counterpart from texts. Similarly, a user’s interest in phone cases inferred from ratings should be aligned with the corresponding interest mined from texts.
However, such alignments are unavailable in advance, requiring a data-driven approach. To address this, we frame the alignment as an optimal transport (OT) problem, treating rating and text interest factors as discrete distributions. This formulation enables to adaptively learn probabilistic correspondences between rating and text interest factors, avoiding rigid one-to-one mappings that risk suboptimal performance. The OT-derived alignment also enables mutual transference of interest signals between modalities, refining user interest representations. Beyond improving accuracy, this approach provides interpretable insights into the relationship between user ratings and textual content.

%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Optimal Transport-derived Alignment Matrix}\label{sec:coupling_prob}
% Existing works assume a fixed one-to-one alignment between rating interest factors $\textbf{z}^{uy}$ and text interest factors $\textbf{z}^{ut}$. However, we postulate that this assumption might lead to suboptimal performance. Alternatively, we propose using optimal transport (OT) to learn a data-driven alignment between interest factors.
Following OT setting, we regard rating factors $\{\textbf{z}^{uy}_k\}_{k=1}^K$ and text factors $\{\textbf{z}^{ut}_j\}_{j=1}^J$ as two discrete distributions. Each factor has probability weight $p^y_k$ and $p^t_j$.
These weights form two probability simplexes, i.e., $\sum_{k=1}^Kp^y_k$ = $1$ and $\sum_{j=1}^Jp^t_j$ = $1$. As the true distribution of $\textbf{z}^{uy}$ (and $\textbf{z}^{ut}$) is not available, we assume uniform distribution by setting weights equally $p^y_k = 1/K\ \forall{k}$ and $p^t_j = 1/J\ \forall{j}$. Let $\pi^u$ be alignment matrix between rating and text factors, defined by $\mathcal{P}^u = \{\pi^u \in \mathbb{R}^{K \times J}_+ | \pi^u\textbf{1}_{J} = p^y, (\pi^u)^T\textbf{1}_{K} = p^t\}$, 
$\textbf{1}_K, \textbf{1}_J$ are K- and J-dimensional one vectors. 
We solve the tractable regularized optimal transport problem \cite{sinkhorn_distance:2013} for $\pi^u$
\begin{equation}\label{eqn:regularized_OT}
    \pi^u = \argmin_{\pi^u \in \mathcal{P}^u} \langle \pi^u, \textbf{S}^u \rangle_F - \epsilon \cdot \mathrm{Entropy}(\pi^u)
\end{equation}
The goal of Equation ~\ref{eqn:regularized_OT} is to minimize the total transporting cost from rating factors to text factors of user $u$, resulting in optimal alignment matrix $\pi^u$.
The first term is the Frobenius dot product between $\pi^u$ and the cost matrix $\textbf{S}^u$ $\in$ $\mathbb{R}^{K \times J}$, $\textbf{S}^u_{kj}$ = $||\textbf{z}^{uy}_k - \textbf{z}^{ut}_j||_2^2$ and $\langle \pi^u, \textbf{S}^u \rangle_F$ = $\sum_{k,j}\pi^u_{kj}\textbf{S}^u_{kj}$. The second term $\mathrm{Entropy}(\pi^u)$ = $\sum_{k, j}{-\pi^u_{kj}log(\pi^u_{kj})}$ is the entropy of $\pi^u$, which is added to make the problem tractable. $\epsilon$ is a hyper-parameter. Small $\epsilon$ results in skewed distribution while large $\epsilon$ leads to relatively uniform distribution in $\pi^u$.
 
To efficiently solve Equation \ref{eqn:regularized_OT} for $\pi^u$, we employ Sinkhorn algorithm \cite{sinkhorn_distance:2013} that alternatively calculates two scaling vectors  $\textbf{u}$ and $\textbf{v}$ until convergence as presented in Algorithm \ref{alg:sinkhorn_alg}. 
This approach is efficient as it is differentiable and is highly supported on GPU for matrix multiplication.
Since Sinkhorn algorithm is theoretically proven to converge to the optimal transport plan \cite{computational_OT:2019}, we therefore stop gradient update to $\pi^u$ after obtained from Algorithm \ref{alg:sinkhorn_alg} to improve efficiency. Empirically, we found that this practice speeds up training while preserving accuracy. 

\input{algorithm/sinkhorn}
%%%%%%%%%%%%%%%%%%%%%%%%
\subsubsection{Transference between interest factors}\label{sec:transformation_fusion}
To enable cross-modal interest transfer and enhance user representations, we propose two approaches. First, we introduce an alignment probability-guided regularization term, where the OT-derived alignment matrix $\pi^u$ guides the learning of connections between rating and text interest factors. Second, we employ a barycentric mapping strategy, projecting rating factors into the text space (and vice versa). This facilitates bidirectional interest transfer, refining interest representations and improving recommendation accuracy.

\textbf{Alignment probability-guided regularization }\label{sec:reg_transfer} optimizes a regularization term guided by $\pi^u$ as following
% as the prior information to guide the optimization of following regularization term
\begin{equation}\label{eqn:OT_reg}
    \small
    \mathcal{L}^{OT}_u = \sum_{k=1}^K\sum_{j=1}^J \pi^u_{kj} \cdot ||\textbf{z}^{uy}_k - \textbf{z}^{ut}_j||_2^2 
\end{equation}
% Minimizing Equation ~\ref{eqn:OT_reg} enforces interests to be transferred from rating factors to text factors and vice versa. 
Thanks to $\pi^u$, the optimization will focus on transferring interest between most probably aligned factors.
% , alleviating negative effect of noisy interest transfer between two factors that are not probably aligned. 
% $\mathcal{L}^{OT}_u$ is included in Equation ~\ref{eqn:final_objective} for optimization. 
Note that while regularization-based interest transfer has been explored in \cite{cdl:2015, cvae:2017, ADDVAE:2022}, both in non-disentangled and disentangled fashions, none of these is guided by alignment probabilities. 
% $\mathcal{L}^t_u$ is empirically shown to be beneficial to \ourmethod's accuracy in Section ~\ref{sec:model_analysis}.\looseness=-1

\textbf{Mapping and fusing. }\label{sec:map_transfer} 
% To better transfer user interests between modalities, we not only involve Equation ~\ref{eqn:OT_reg} but also supervision signals from two modalities. 
% We fuse rating (text) factors with text (rating) counterparts so they could capture both modality signals, requiring a mutual mapping between these two.\looseness=-1
To capture interest signals across modalities, we fuse rating and text factors via barycentric mapping \cite{MappingDOT:2016, OT_DA:2016}.

\underline{\emph{Barycentric Mapping}}. 
% Barycentric strategy \cite{MappingDOT:2016, OT_DA:2016} offers an elegant way to map interest factors. 
Note each entry in the alignment matrix $\pi^u_{kj}$ indicates how much of the probability mass from a rating factor, $\textbf{z}^{uy}_k$, should be transferred to the corresponding text factor, $\textbf{z}^{ut}_j$.
% how much probability mass of $\textbf{z}^{uy}_k$ to be transferred to $\textbf{z}^{ut}_j$. 
Thus, using $\pi^u$, we can map rating factors onto text space via solving $\hat{\textbf{z}}^{uy}_k = \argmin_{\textbf{s}^t \in \mathbb{R}^d} \sum_{j} \pi^u_{kj}c(\textbf{s}^t, \textbf{z}^{ut}_j)$, where $\hat{\textbf{z}}^{uy}_k$ is the transformation of $\textbf{z}^{uy}_k$ in text space and $c(\cdot, \cdot)$ is the cost function. Following \cite{OT_DA:2016}, the solution for $\hat{\textbf{z}}^{uy}_k$ is
\begin{equation}\label{eqn:transport_rating}
    \hat{\textbf{z}}^{uy}_k = diag(\pi^u_k\textbf{1}_J)^{-1}\pi^u_k\textbf{z}^{ut}
\end{equation}
where $\textbf{z}^{ut} = \{\textbf{z}^{ut}_j\}_{j=1}^J \in 
 \mathbb{R}^{J \times d}$. We repeat Equation ~\ref{eqn:transport_rating} $\forall{k=1, 2, ..., K}$ to obtain $\{\hat{\textbf{z}}^{uy}_k\}_{k=1}^K$.
Similarly, we compute $\hat{\textbf{z}}^{ut}_j$, the transformation of text factor $\textbf{z}^{ut}_j$ onto rating space
 % We compute $\hat{\textbf{z}}^{ut}_j$ as
\begin{equation}\label{eqn:transport_text}
    \hat{\textbf{z}}^{ut}_j = diag((\pi^u)^T_j\textbf{1}_K)^{-1}(\pi^u)^T_j\textbf{z}^{uy}
\end{equation}
where $\textbf{z}^{uy} = \{\textbf{z}^{uy}_k\}_{k=1}^K \in \mathbb{R}^{K \times d}$. We obtain $\{\hat{\textbf{z}}^{ut}_j\}_{j=1}^J$ by applying Equation ~\ref{eqn:transport_text} for $\forall{j=1, 2, ..., J}$.

\underline{\emph{Adaptively Fusing.}} We fuse $\{\textbf{z}^{uy}_k\}_{k=1}^K$ with their transformed versions $\{\hat{\textbf{z}}^{uy}_k\}_{k=1}^K$ to create input for rating decoder, enabling transferring rating signals explicitly to text space via $\{\hat{\textbf{z}}^{uy}_k\}_{k=1}^K$. As each user's decision bases individually on ratings and texts, we design an adaptive fusion layer as
\begin{equation}\label{eqn:fuse_rating}
    \tilde{\textbf{z}}^{uy}_k = \textbf{z}^{uy}_k + \rho^{uy}_k \cdot \hat{\textbf{z}}^{uy}_k, \hspace{2mm} \forall{k = 1, 2, ..., K} 
\end{equation}
$\rho^{uy}_k = log( 1 + exp(\zeta([\textbf{z}^{uy}_k; \hat{\textbf{z}}^{uy}_k])))$ is the fusion weight and $;$ is concatenation.
$\zeta: \mathbb{R}^{2d} \rightarrow \mathbb{R}^1$ is a neural network. 
Similarly, a fusion layer is applied for text factors
\begin{equation}\label{eqn:fuse_text}
    \tilde{\textbf{z}}^{ut}_j = \textbf{z}^{ut}_j + \rho^{ut}_j \cdot \hat{\textbf{z}}^{ut}_j, \hspace{2mm} \forall{j = 1, 2, ..., J} 
    % \hspace{2mm} \text{with} \hspace{2mm} \rho^{ut} = log(1 + exp(\textbf{W}^T[\textbf{z}^{ut}_j; \hat{\textbf{z}}^{ut}_j]))
\end{equation}
$\rho^{ut}_j = log(1 + exp(\zeta([\textbf{z}^{ut}_j; \hat{\textbf{z}}^{ut}_j])))$. $\zeta$ here is the same as one in rating fusion. By this design, $\rho^{uy}$ and $\rho^{ut}$ are dynamically learned for each individual user.
Then, $\tilde{\textbf{z}}^{uy} = \{\tilde{\textbf{z}}^{uy}_k\}_{k=1}^K$ and $\tilde{\textbf{z}}^{ut} = \{\tilde{\textbf{z}}^{ut}_j\}_{j=1}^J$ go to rating and text decoders, respectively.\looseness=-1
%%%%%%%%%%%%%%%%%%%%%%
\subsection{Decoder}\label{sec:decoder}
\textbf{Rating decoder} $\bm{\mathcal{D}}^y$ of rating channel accepts user $u$'s fused rating factors $\tilde{\textbf{z}}^{uy} = \{\tilde{\textbf{z}}^{uy}_k\}_{k=1}^K$
% , item assignment score matrix $\textbf{A}^{uy}$, item embedding matrix $\textbf{H}$, and temperature $\tau$ 
as input. 
$\bm{\mathcal{D}}^y$ predicts the probability of an interaction between a user $u$ and an item $i$ as the weighted sum of rating factors' predictions
\begin{equation}\label{eqn:pred_rating}
    \begin{gathered}
        \footnotesize
         p(\textbf{y}^u_i) = \frac{\sum_{k=1}^K\textbf{A}^{uy}_{ik} \cdot exp(s({\tilde{\textbf{z}}^{uy}_k, \textbf{H}_i}) / \tau)}{\sum_{i'=1}^N \sum_{k=1}^K\textbf{A}^{uy}_{ik} \cdot exp(s({\tilde{\textbf{z}}^{uy}_k, \textbf{H}_{i'}}) / \tau)} 
    \end{gathered}
\end{equation}
$s(\cdot, \cdot)$ is cosine similarity.
The learning objective includes cross-entropy loss to match the predicted interaction probabilities $p(\textbf{y}^u)$ with observed interactions $\textbf{y}^u$ and KL divergence term (controlled by $\beta^y$) from rating encoder $\bm{\mathcal{E}}^y$.
\begin{equation}\label{eqn:rating_objective}
    \resizebox{0.95\columnwidth}{!}{$
    \mathcal{L}^y_u = \sum_{i=1}^N-\textbf{y}^u_iln\ p(\textbf{y}^u_i) +\beta^y \cdot D^y_{KL}(q(\textbf{z}^{uy}|\textbf{y}^u, \textbf{A}^{uy}) || p(\textbf{z}^{uy}))
    $}
\end{equation}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\noindent
\textbf{Text decoder} $\bm{\mathcal{D}}^t$ of text channel has user $u$'s fused text factors $\tilde{\textbf{z}}^{ut} = \{\tilde{\textbf{z}}^{ut}_j\}_{j=1}^J$ as input.
% , word-cluster assignment matrix $\textbf{A}^{ut}$, word embedding matrix $\textbf{E}$, and temperature $\tau$. 
$\bm{\mathcal{D}}^t$ predicts the probability of a word $w$ appearing in textual content associated with user $u$ as the weighted sum of text factors' predictions
\begin{equation}
    \begin{gathered}
         \small
         p(\textbf{t}^u_w) = \frac{\sum_{j=1}^J\textbf{A}^{ut}_{wj} \cdot exp(s({\tilde{\textbf{z}}^{ut}_j, \textbf{E}_w}) / \tau)}{\sum_{w'=1}^W \sum_{j=1}^J\textbf{A}^{ut}_{wj} \cdot exp(s({\tilde{\textbf{z}}^{ut}_j, \textbf{E}_{w'}}) / \tau)}
    \end{gathered}
\end{equation}
$s(\cdot, \cdot)$ is cosine similarity. Similarly, the learning objective includes cross-entropy term to match predicted probability $p(\textbf{t}^u)$ with observed textual information $\textbf{t}^u$ and KL divergence term derived from text encoder $\bm{\mathcal{E}}^t$, controlled by $\beta^t$.
\begin{equation}\label{eqn:text_objective}
    % \begin{gathered}
    \resizebox{0.95\columnwidth}{!}{$
    \mathcal{L}^t_u = \sum_{w=1}^W-\textbf{t}^u_wln\ p(\textbf{t}^u_w) + \beta^t \cdot D^t_{KL}(q(\textbf{z}^{ut}|\textbf{t}^u, \textbf{A}^{ut}) || p(\textbf{z}^{ut}))
    $}
    % \end{gathered}
\end{equation}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\textbf{Final learning objective.} Given a batch of user $\mathcal{B}$, \ourmethod\ minimizes 
$\mathcal{L} = \frac{1}{||\mathcal{B}||}\sum_{u \in \mathcal{B}} \mathcal{L}^y_u + \lambda_t \cdot \mathcal{L}^t_u + \lambda_r \cdot \mathcal{L}^{OT}_u$\label{eqn:final_objective}. 
$\lambda_t$ and $\lambda_r$ are hyper-parameters.
Algorithm \ref{alg:pseudo_code} presents the training procedure of \ourmethod.
\input{supplementary/algorithm/training}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Extension}\label{sec:extension}
Our method, while focuses on two modalities, can be easily extended to multiple modalities. Suppose there is a set of user-associated modalities $\mathcal{T}$ (e.g., text, image, audio) in addition to rating modality $y$. For each modality $m \in \mathcal{T} \cup \{y\}$, an encoder $\mathcal{E}^m$ (Section \ref{sec:encoder}) is employed to discover $K^m$ interest factors $\{\textbf{z}^{um}_{k}\}_{k=1}^{K^m}$ for each user $u$. Then, the OT-based alignment module $\mathcal{A}$ (Section \ref{sec:OT_alignment_module}) fuses interest factors from $m$ with those from $y$ to obtain $\tilde{\textbf{z}}^{um}$. Each modality $m$ has a decoder $\mathcal{D}^m$ to reconstruct the respective input, where $\mathcal{D}^m$ accepts $\tilde{\textbf{z}}^{um}$ as input. The learning objective becomes $\mathcal{L} = \mathcal{L}^{recon}_y + \sum_{m}^{\mathcal{T}}(\lambda_m \cdot \mathcal{L}^{recon}_m + \lambda_{rm} \mathcal{L}^{OT}_{ym})$, where $\mathcal{L}^{recon}$ is the reconstruction loss as Equation ~\ref{eqn:rating_objective} and \ref{eqn:text_objective} while $\mathcal{L}^{OT}_{ym}$ regularizes interest factors of two modalities $m$ and $y$ as Equation \ref{eqn:OT_reg}. These losses are controlled by $\lambda_m$ and $\lambda_{rm}$.