\section{Technical Appendix}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Implementation}
Algorithm \ref{alg:pseudo_code} presents the training procedure of \ourmethod.
For review purpose, we release \textbf{code, data and related materials}, e.g., data pre-processing, final hyper-parameter configurations, software specifications, etc. in the anonymized link \emph{https://tinyurl.com/4fesktb2}
\input{supplementary/algorithm/training}. 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Text Encoder} \label{sec:text_encoder_full}
Let $\bm{\mathcal{E}}^t$ be the encoder of text channel in Figure \ref{fig:main_architecture}, which accepts a user $u$'s associated textual content $\textbf{t}^u$ as input.
To discover $J$ interest factors behind $\textbf{t}^u$, we leverage prototpyes stored in $\textbf{m}^t \in \mathbb{R}^{J \times d}$.
Unlike $\bm{\mathcal{E}}^y$, we empirically found that iteratively updating  $\textbf{m}^t$ does not show clear improvements. Thus, to maintain efficiency level, we set $\textbf{m}^{ut} = \textbf{m}^t$ for all user $u$, i.e., for textual content modeling, all users share the same set of prototypes. We first group $W$ words into $J$ clusters in a soft manner, producing word assignment score matrix $\textbf{A}^{ut} \in \mathbb{R}^{W \times J}$ as\looseness=-1

\begin{equation}\label{eqn:text_prototype_update}
    \begin{gathered}
        \small
        \textbf{A}^{ut} = \eta(\frac{\textbf{E}\cdot (\textbf{m}^{ut})^T}{\tau \cdot ||\textbf{E}||_2 \cdot ||\textbf{m}^{ut}||_2})
    \end{gathered}
\end{equation}
Similar to rating encoder $\bm{\mathcal{E}}^y$, we have $\eta$ as Gumbel-Softmax \cite{gumbel_softmax:2017, Concrete_dist:2017}.

Next, we estimate two parameters $\bm{\mu}^{ut}_j$ and $\bm{\sigma}^{ut}_j$ of Gaussian distribution for each text interest factor $j$ as following
\begin{equation}\label{eqn:text_factor_j_repr}
    \begin{gathered} 
        \small
        (\textbf{r}^{ut}_j, \textbf{o}^{ut}_j) = \textbf{W}_2 tanh(\textbf{W}_1norm(\textbf{A}^{ut}_j \odot \textbf{t}^u) + \textbf{b}_1) + \textbf{b}_2
        \\
        \bm{\mu}^{ut}_j = \frac{\textbf{r}^{ut}_j}{||\textbf{r}^{ut}_j||_2}; \hspace{2mm} \bm{\sigma}^{ut}_j = \sigma^t \cdot exp(-\frac{1}{2}\textbf{o}^{ut}_j) 
    \end{gathered}
\end{equation}
in which $\odot$ is element-wise multiplication. $norm(\textbf{x}) = \textbf{x} / ||\textbf{x}||_2$ normalizes input to unit-length vector for stable computation. $\textbf{W}_1 \in \mathbb{R}^{W \times D}, \textbf{b}_1 \in \mathbb{R}^{D}, \textbf{W}_2 \in \mathbb{R}^{D \times 2d}, \textbf{b}_2 \in \mathbb{R}^{2d}$ are weight matrices and bias vectors. Note that these learnable parameters are distinct from those of $\bm{\mathcal{E}}^y$. $\sigma^t$'s value is around 0.1, following \cite{macridvae:2019}.
Then $j^{th}$ text factor is sampled as $\textbf{z}^{ut}_j \sim \mathcal{N}(\bm{\mu}^{ut}_j, [diag(\bm{\sigma}^{ut}_j)]^2)$, which is repeated $\forall{j=1, 2, ..., J}$.  
Assuming the independence between text factors of user $u$, we have
$q(\textbf{z}^{ut}|\textbf{t}^u, \textbf{A}^{ut}) = \prod_{j=1}^J \mathcal{N}(\bm{\mu}^{ut}_j, [diag(\bm{\sigma}^{ut}_j)]^2)$, which is called variational distribution.
$q(\textbf{z}^{ut}|\textbf{t}^u, \textbf{A}^{ut})$ is matched with prior distribution $p(\textbf{z}^{ut}) = \mathcal{N}(\textbf{0}, (\sigma^t)^2\textbf{I})$ via Kullback-Leibler divergence ($D^t_{KL}$). As $p(\textbf{z}^{ut})$ is a factorized distribution, optimizing $D^t_{KL}$ also imposes micro-disentanglement, i.e., disentanglement between dimensions of representation sampled from $q(\textbf{z}^{ut}|\textbf{t}^u, \textbf{A}^{ut})$. We add the regularization term $D^t_{KL}(q(\textbf{z}^{ut}|\textbf{t}^u, \textbf{A}^{ut}) || p(\textbf{z}^{ut}))$ into Equation \ref{eqn:text_objective} for optimization.\looseness=-1

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Data Pre-processing}
In this work, we leverage four publicly available datasets CiteULike-a, MovieLens and two categories of Amazon datasets, which are widely adopted in prior works \cite{MDCVAE:2022, ADDVAE:2022, TopicVAE:2022, SemMacridVAE:2023}.
For CiteULike-a, Cell Phones and Video Games datasets, we use the accompanying textual content, i.e., title \& abstract for CiteULike-a and item descriptions for Cell Phones. For Cell Phones , we retain users with at least $8$ interactions and items with at least $5$ interactions and for Video Games, these numbers are 5 and 5, respectively. For MovieLens, we follow \cite{MDCVAE:2022} to extract a subset of users from ML-10M version. We keep user ratings larger than $3$ as interactions, following \cite{macridvae:2019}. We collect item textual content for Movielens from IMDB \footnote{https://datasets.imdbws.com/}. For all datasets, we remove stop words and only keep words with frequency higher than $3$ and appearing in less than $60\%$ of item texts. Following \cite{MDCVAE:2022}, top $8k$ words with highest frequency are retained to construct vocabulary. 

We adopt \emph{strong generalization} setting, following \cite{macridvae:2019}, to construct training, validation and test sets by randomly choosing $80\%$ of users for training and $10\%$ of users for each validation and test sets. For validation and test sets, $20\%$ of a user interactions is kept as ground truth (test data). To keep the quality of datasets, we only retain items with at least $5$ words in their textual content so that the textual content brings semantic information. All cold-start items, i.e., those do no appear in training set, are discarded since there is no parameters associating with them, following the common practice in the field.

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Baseline Description}
We compare \ourmethod\ against a series of comparable state-of-the-art recommendation models capable of handling unseen users in strong generalization setting.
\begin{itemize}[leftmargin=*]
    \item \textbf{MacridVAE} \cite{macridvae:2019} introduces macro- and micro-disentanglement of user preferences via multi-prototype representation and independence regularization.\looseness=-1
    \item \textbf{RecVAE} \cite{RecVAE:2020} proposes composite prior, rescaling regularization term and an alternative training into a novel VAE-based recommendation model. 
    \item \textbf{MDCVAE} \cite{MDCVAE:2022} regularizes decoder weights of the user-oriented autoencoder by latent embeddings inferred from textual content.
    \item \textbf{TopicVAE} \cite{TopicVAE:2022} improves disentangling user preferences by designing attention-based topic extraction from textual content, topic-guided contrastive loss and heuristic method to set value of regularization term.
    \item \textbf{ADDVAE} \cite{ADDVAE:2022} leverages two disentangled networks to model user's ratings and user associated texts then aligns disentangled factors from these two modalities using compositional de-attention and regularization.
    \item  \textbf{ELSA} \cite{ELSA:2022} improves state-of-the-art linear autoencoder by factorizing hidden space into a low-rank plus sparse structure.
    \item \textbf{SEM-MacridVAE} \cite{SemMacridVAE:2023} exploits semantic knowledge from side information to improve VAE-based disentangled recommendation models. We use tf-idf item-word matrix, i.e., $\textbf{W} = \{\textbf{w}^i\}_{i=1}^N$, as side information for fair comparison.
    \item \textbf{VALID} \cite{VALID:2023} improves disentangling user preferences under VAE framework by employing iterative latent attention and implicit differentiation.
    \item  \textbf{FacetVAE} \cite{FacetVAE:2024} disentangles multi-faceted item space and derive compositional user interests via bi-directional binding block
\end{itemize}

We follow the strong generalization setting in MacridVAE \cite{macridvae:2019}, i.e., validation and test sets include users not appearing in training set. Thus, we only involve baselines capable of predicting interactions for users not appearing in training data. Some other baselines capable of recommending items to unseen users, e.g., SLIM, EASE, SimpleX, are already outperformed by our strong baselines RecVAE, ELSA and VALID. Thus, we only retain state-of-the-art models as our baselines.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Hyper-parameter Configurations}
Pertaining to baselines, we follow their original papers to choose hyper-parameters, i.e., we use grid search to search for hyper-parameters in the same range described in those papers. Hyper-parameters are chosen based on performance on validation set. Then, we retrain all baselines with chosen hyper-parameters and report results on test set. Reported numbers are averaged over ten runs.

Regarding \ourmethod, the default settings are $D = 300$ for MovieLens and Cell Phones and $D=600$ for CiteULike-a and Video Games; embdding size $d = 100$ for all datasets; dropout rate applied for $\textbf{A}^{uy}$ and $\textbf{A}^{ut}$ is $0.5$; number of rating and text factors are $K = 4$ and $J = 4$, respectively (more values of $K$ and $J$ are analyzed in subsequenct sections; $\beta^y$ and $\beta^t$ follow annealing process $min(\beta_0, \frac{update}{T})$ where $\beta_0 = 1$ for rating channel and $\beta_0 = 0.2$ for text channel, $T$ is chosen from $\{1k, 5k, 10k, 20k\}$, and $update$ is the number parameter updates; $\sigma^y$ and $\sigma^t$ are chosen from $\{0.05, 0.075, 0.1\}$; the search space of $\lambda_t$ and $\lambda_r$ is $\{0.1, 0.2, 0.5, 1, 2, 5\}$; $\epsilon \in \{0.2, 0.5, 1\}$ in Sinkhorn algorithm. Archiecture of fusion network $\zeta: 2d \rightarrow d/2 \rightarrow 1$. The number of prototype update steps $L^y$ in rating encoder and $L^t$ in text encoder are chosen from $\{1, 2, 3, 4\}$. We train \ourmethod\ using Adam optimizer with learning rate $0.001$ on  NVIDIA RTX 2080 Ti GPU machine. Training stops after $30$ epochs without improving performance on validation set. The final hyper-parameters of \ourmethod\ can be found in the accompanying code.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{figures/epsilon/epsilon}
\subsection{Effect of $\epsilon$ in Equation \ref{eqn:regularized_OT} on Recommendation Performance. } $\epsilon$ controls the sparsity of $\pi^u$, i.e., small $\epsilon$ results in highly skewed distribution in $\pi^u$ while large $\epsilon$ leads to roughly uniform distribution in $\pi^u$. 
Thus, $\epsilon$ also controls the \emph{interpretability} of $\pi^u$. 
That is, small $\epsilon$ generates a nearly one-to-one mapping between rating and text factors and thus, we can explain one rating factor by the matched text factor. Contrarily, large $\epsilon$ produces roughly one-to-many mapping between rating and text factors, which make it difficult when one would like to explain a rating factor via textual content. Figure \ref{fig:epsilon} shows the results. We observe that the effect of $\epsilon$ is data-dependent. 
On CiteULike-a, Cell Phones and MovieLens, moderate $\epsilon$, i.e., $\epsilon \geq 0.2$, leads to higher recommendation accuracy than smaller ones, i.e., $\epsilon < 0.2$, indicating that there is a trade-off between recommendation and interpretability as the later one favors small $\epsilon$. Pertaining to Video Games, $\epsilon < 0.2$ generally results in higher accuracy, indicating the correlation between interpretability and accuracy. 
As $\epsilon$'s value is data-dependent, it requires careful analysis to achieve good performance. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{figures/case_study/case_study_epsilon}
\subsection{Effect of $\epsilon$ on Alignment Probabilities $\pi^u$}
Theoretically, small $\epsilon$ results in sparse $\pi^u$ while large $\epsilon$ leads to roughly uniform $\pi^u$ in Equation \ref{eqn:optimal_plan}. Note that sparse alignment is favorable to interpretability as it mimics approximately one-to-one mapping between rating and text factors. To verify, we visualize the alignment probabilities produced by our model \ourmethod\ w.r.t. various $\epsilon$ in Figure ~\ref{fig:epsilon_case_study}. Obviously, small $\epsilon$ results in staggered pattern in alignment distribution while large $\epsilon$ leads to roughly uniform distribution. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{supplementary/num_factors/num_factors}
\subsection{Analysis on the Number of Interest Factors}\label{sec:num_factors}
Thanks to the pair-wise alignment between interest factors, \ourmethod\ can deal with the case that the number of rating interest factors differs from the number of text interest factors. As users might demonstrate different behaviors in different modalities, this capability makes \ourmethod\ more applicable. We report recommendation accuracy w.r.t. various numbers of rating factors $K$ and numbers of text factors $J$ in Figure ~\ref{fig:K_J_comp_supp}. There are four data-dependent observations. 
First, on CiteULike-a, setting $K = 4$ generally results in better recommendation accuracy. Excessive number of rating factors, e.g., $K \geq 6$, causes to a detrimental effect. Increasing the number of text factors $J$, given $K = 4$, also causes negative effect.
Second, on Cell Phones, it requires at least $4$ rating factors to model user interests.  
Adding more rating factors, $K > 6$, generally brings benefits. 
While increasing the number of rating factors does benefit, adding more text factors generally does not help that much.
Third, on MovieLens, setting $K = 4$ generally achieves better accuracy. Regarding $J$, it is clear that using an exaggerated number of text factors, i.e., $J > 6$, negatively affects performance and setting $J = 4$ brings both accuracy and efficiency for \ourmethod.
Forth, on Video Games, it requires at least $K = 4$ rating factors and increasing $K$ further is generally helpful. In contrast, increasing the number of text factors to an excessive  number causes performance degradation.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{figures/lambda_r/lambda_r}
\input{figures/lambda_t/lambda_t}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Effect of $\lambda_r$ on \ourmethod's Performance}
Figure ~\ref{fig:lambda_r} presents \ourmethod's accuracy w.r.t. $\lambda_r$, which controls the effect of regularization term for interest transfer between rating and text factors. First, we observe that setting $\lambda_r$ to $1$ or $0.5$ results in higher accuracy w.r.t. Recall@10 and NDCG@10 on chosen datasets. Second, while $\lambda_r \geq 0.5$ is favorable showing that imposing a regularization term between interest factors from ratings and texts (as in Equation ~\ref{eqn:OT_reg}) works to some degree, $\lambda_r$ should be chosen carefully as an excessive value might cause detrimental effect.\looseness=-1 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Effect of $\lambda_t$ on \ourmethod's Performance}
Figure ~\ref{fig:lambda_t} presents the influence of $\lambda_t$, which controls the effect of text reconstruction objective, on \ourmethod's accuracy. First, on each dataset, choosing a proper value of $\lambda_t$ results in higher accuracy than setting $\lambda_t = 0$, showing textual content reconstruction benefits \ourmethod's recommendation accuracy. This further confirms our hypothesis that transferring interest signals between rating and textual modalities is beneficial. Second, the influence of $\lambda_t$ varies across datasets and therefore, $\lambda_t$ should be chosen carefully to obtain higher Recall and NDCG.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{table/running_time/running_time}
\subsection{Efficiency Analysis } Table \ref{tab:running_time} analyzes the efficiency of \ourmethod\ and two strongest baselines ADDVAE and VALID. For each model, we record the training time per epoch (in second) (averaged over ten runs) and the memory required for training (in GB). There are three key takeaways. First, \ourmethod\ maintains a comparable efficiency level yet achieves higher recommendation accuracy than ADDVAE and VALID. Second, the training time and memory gaps between VALID and \ourmethod\ come from textual content modeling component, i.e., text channel, in \ourmethod\ yet do not appear in VALID. Third, despite both including a textual content modeling module, \ourmethod\ employs multiple prototype updates in rating encoder while ADDVAE does not, which results in difference in efficiency level of these two models. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{figures/prototype_update_rating/prototype_update_rating}
\subsection{Number of Rating Prototype Update Steps}\label{sec:rating_prototype_update}
Figure ~\ref{fig:rating_prototype_update} shows \ourmethod's recommendation accuracy w.r.t. $L^y$, the number of prototype update steps in rating encoder $\bm{\mathcal{E}}^y$. Obviously, using more than one prototype update step leads to higher recommendation accuracy in all chosen datasets, which confirms the design of our rating encoder. This finding is consistent with \cite{VALID:2023}. Moreover, $L^y$ is data-dependent, i.e., to achieve favorable recommendation accuracy on CiteULike-a, Cell Phones and MovieLens, the respective values of $L^y$ are $2, 4, 3$. 

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\input{figures/prototype_update_text/prototype_update_text}
\subsection{Number of Text Prototype Update Steps}\label{sec:text_prototype_update}
Figure ~\ref{fig:text_prototype_update} presents the influence of the number of prototype update steps in text encoder $\bm{\mathcal{E}}^t$ on \ourmethod's performance. The effect of $L^t$ is contrary to that of $L^y$, i.e., involving more prototype update steps results in a reduction in \ourmethod's performance. We conjecture that as users' comprehensions of textual content, i.e., words and phrases, are roughly the same and thus, imposing personalization into word clustering in text encoder via setting $L^t > 1$ causes a detrimental effect. As such, we fix $L^t = 1$ for all datasets through out the paper to maintain both effectiveness and efficiency, i.e., $L^t = 1$ consumes less memory than $L^t > 1$.