%%%%%%%%%%%%%%%%%%%%%%%
\textbf{VAE-based disentangled representation learning } aims at uncovering latent explanatory factors, enabling robust modeling of complex data patterns \cite{RLReview:2013}. Early works \cite{betavae:2017, betavae:2018, disen_factorize:2018, isolating_vae:2018, ChallengeDisenRL:2019} focused on disentangling each dimension of representation vector to encodes a distinct feature. Recent advances extend this to disentangle user preference factors at both dimension and intention levels \cite{macridvae:2019, VALID:2023, FacetVAE:2024, DualVAE:2024}. To enhance disentanglement, researchers have incorporated auxiliary data, e.g., textual content \cite{TopicVAE:2022}, visual information \cite{SemMacridVAE:2023}, multi-modal features \cite{AlignMacridVAE:2024}, and social relationships \cite{curcodis:2023}.
While these works share our goal of disentangling user preferences, our key distinction lies in leveraging optimal transport (OT) to align rating and text interest factors probabilistically. 
% Although we mainly focus on ratings and texts, our framework is extensible to multi-modal settings, as discussed in Section \ref{sec:extension}.

% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% \textbf{Textual content-aware recommendation. } 
% Early methods \cite{ctr:2011, cdl:2015, convmf:2016, gate:2019} leverage deep neural networks to model item textual content, thereby enhancing recommendation performance. Later, VAE has been widely adopted for this task, both in non-disentangled \cite{VBAE:2023, cvae:2017, MDCVAE:2022} and disentangled fashions \cite{dicer:2020, ADDVAE:2022, TopicVAE:2022}.
% What distinguishes our work from these is the introduction of an optimal transport (OT)-based approach to align and fuse interest factors from ratings and textual content, which provides a more flexible and nuanced alignment between interest factors.
% Recently, pre-trained language models (PLMs), e.g., \cite{BERT:2019}, have been explored to generate text-based item representations for recommendation \cite{UniSRec:2022, BM3:2023, TIGER:2023}. While PLMs offer powerful text encodings, they tend to compress the entire content into a single vector, which ignores the intricate structure and multi-faceted nature of textual data. In contrast, our work focuses on disentangling multiple interest factors from textual content to capture a richer representation of user preferences. Thus, we leave the integration of PLMs into our framework as a direction for future work.
% \cite{DGVAE:2024, AlignMacridVAE:2024} leverage textual and visual data for recommendation tasks, which differs significantly from ours, particularly in their use of multi-modal features. As a result, this work is not not directly comparable with our model, which focuses solely on aligning ratings and textual content.
% Additionally, our work is related to hybrid recommender systems \cite{FM:2010, HybridSVD:2019, EASEr:2020, MDCF:2023}, which aim to tackle challenges like the cold-start problem by combining multiple data sources. However, our primary objective in this paper is to discover and align multiple interest factors across modalities in a warm-start setting, where sufficient interaction data is available. Addressing the cold-start problem, though relevant, falls outside the scope of this work.\looseness=-1
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\textbf{Text-aware recommendation} improves performance by integrating item textual content via neural networks \cite{ctr:2011, cdl:2015, convmf:2016, gate:2019}. VAEs have since been widely adopted, both in non-disentangled \cite{VBAE:2023, cvae:2017, MDCVAE:2022} and disentangled forms \cite{dicer:2020, ADDVAE:2022, TopicVAE:2022}. Our work differs by introducing optimal transport to probabilistically align rating and text interest factors. \cite{UniSRec:2022, TIGER:2023} adopted pre-trained language models (PLMs) for text-based recommendation. However, PLMs compress textual data into a single vector, overlooking its multi-faceted structure. Instead, we focus on disentangling multiple interest factors from texts. \cite{DGVAE:2024, AlignMacridVAE:2024} leverage multi-modal data (e.g., text and images), which differs from our focus on aligning ratings and texts. While related to hybrid recommendation \cite{FM:2010, HybridSVD:2019, EASEr:2020, MDCF:2023}, we follow warm-start setting rather than addressing the cold-start problem, which is beyond our scope.

% %%%%%%%%%%%%%%%%%%%
% \textbf{Optimal transport and its applications. } 
% Optimal Transport (OT) offers an elegant framework to measure the distance between two probability distributions and facilitates the transformation of points from one distribution to another \cite{computational_OT:2019}. The popular method for computing optimal transport plan is Sinkhorn algorithm \cite{sinkhorn_distance:2013, AutoDiff:2018}, which offers efficient and GPU-friendly framework and thus, has enabled numerous applications across various domains. For example, OT has been utilized in domain adaptation \cite{OT_reglab:2014, OT_DA:2016}, model fusion \cite{fusion_OT:2020} and attention-based models \cite{Align_transformer:2021, Sinkformer:2022}.
% Additionally, OT has demonstrated its effectiveness in fusing multi-modal knowledge graph data \cite{OTKGE:2022}, enhancing the coherence of topic modeling via regularization \cite{ECRTM:2023}, and improving object-centric learning \cite{MESH:2023}.
% In recommender systems, OT has also been widely explored, e.g., aggregating non-local information in graph-based recommendation \cite{GOTNet:2022}, or finding user correspondence in cross-domain recommendation setting \cite{UDMCF:2024}.
% Our work builds on OT but adopts an orthogonal approach. 
% Specifically, we apply OT to align and fuse mutually disentangled user interest factors derived from ratings and textual content.
% By leveraging OT to perform this alignment in a data-driven manner, our approach allows for a more flexible and personalized representation of user preferences across modalities, leading to higher recommendation accuracy and offering a feasible method to gain insights into the relationship between user interactions and textual content.
\textbf{Optimal Transport (OT)} offers a principled framework for measuring distances between distributions and mapping them efficiently \cite{computational_OT:2019}. Sinkhorn algorithm \cite{sinkhorn_distance:2013, AutoDiff:2018} have enabled OT applications in domain adaptation \cite{OT_reglab:2014, OT_DA:2016}, model fusion \cite{fusion_OT:2020}, attention \cite{Align_transformer:2021, Sinkformer:2022}, multi-modal knowledge fusion \cite{OTKGE:2022}, topic modeling \cite{ECRTM:2023}, and object-centric learning \cite{MESH:2023}. In recommender systems, OT has been applied to graph-based aggregation \cite{GOTNet:2022} and cross-domain user correspondence \cite{UDMCF:2024}. Our work diverges by using OT to probabilistically align cross-modal interest factors. This approach not only improves recommendation accuracy but also provides interpretable insights into user-text interactions, a novel application of OT in this domain.