User-item interactions are driven by many hidden factors. 
Variational Autoencoder (VAE) offers an elegant framework to discover multiple preference factors. Current studies range from disentangling user interests merely from  rating data, \cite{macridvae:2019, VALID:2023, FacetVAE:2024} to mining interest factors from both rating data and side information such as textual content \cite{TopicVAE:2022}, visual content \cite{SemMacridVAE:2023}, social relationships \cite{curcodis:2023}, multi-modal data \cite{AlignMacridVAE:2024}.

Preference signals extracted from side information, such as textual content, could complement those derived from user ratings. Since rating data merely contains user and item IDs, which lack semantic depth, incorporating semantic textual content results in more expressive user and item representations. This method is especially beneficial for users with limited interactions, as textual content offers additional insights into their preferences.
Moreover, text-based interest factors naturally offers interpretability of user preferences as humans can understand their meaning. 
% Therefore, we look into modeling and aligning interest factors from ratings and texts.

% \cite{ADDVAE:2022} pioneered aligning cross-modal interest factors for text-aware recommendation, extended subsequently by \cite{DGVAE:2024} in multi-modal recommendation. However, these works fixed a one-to-one correspondence to align rating- and text- preference factors, posing a couple of limitations.
% First, fixing the alignment between cross-modal preference factors is overly restrictive. This approach does not account for variations of alignment across users, leading to suboptimal performance.
% Second, they underutilize cross-modal interest signals by neglecting the variability of textual content's influence on user preferences. Thus, these models risks underperforming by failing to capture the nuances of personalized interest transference.
\cite{ADDVAE:2022} pioneered aligning cross-modal interest factors for text-aware recommendation, later extended to multi-modal settings by \cite{DGVAE:2024}. However, these works impose a fixed one-to-one correspondence between rating- and text-based preference factors, leading to two key limitations. 
% First, rigid alignments ignore user-specific variations, resulting in suboptimal performance. Second, they underutilize cross-modal signals by neglecting the personalized influence of textual content on preferences. 
First, the rigid alignment of interest factors, i.e., one-to-one correspondence, is shared across all users, which ignores user-specific variations, e.g., some users may exhibit many-to-one or one-to-many interest correlations. Second, the uniform treatment of modalities treats rating- and text-based factors equally (e.g., simple averaging), assuming a universal importance of both modalities. However, users vary in how much they rely on textual versus rating signals when interacting with items.
These shortcomings hinder their ability to capture nuanced, adaptive interest transference across modalities.

We propose \ourmethod, short for \underline{B}arycentric \underline{A}lig\underline{N}ment of Mutually \underline{D}isentangled Interest Factors with \underline{V}ariational \underline{A}uto\underline{E}ncoder, a novel VAE framework that leverages optimal transport (OT) to address these gaps. 
\ourmethod\ learns soft, user-dependent alignments via an OT-enabled method, allowing more flexible and personalized cross-modal interactions; and adapts the fusion weights per user, capturing this personalized modality preference.
First, \ourmethod\ uncovers user preference factors from ratings and texts via unsupervised prototype learning. Second, \ourmethod\ reframes cross-modal preference factor alignment as an OT problem: interest factors from each modality are treated as distributions, and the Sinkhorn algorithm computes a probabilistic transport plan, i.e., the alignment matrix. This matrix serves dual roles: 1) to compute a regularization term that aligns cross-modal preference factors, avoiding rigid correspondence assumptions, and 2) to adaptively transfer interest signals across modalities via barycentric mapping. By integrating OT, \ourmethod\ effectively transfers preference signals, addressing personalization variability.

% Firstly, \ourmethod\ simultaneously uncovers multiple user preference factors from both ratings and texts through unsupervised prototype-based representation learning. 

% Secondly, to overcome the challenge of aligning these preference factors across modalities, we frame this problem as an optimal transport (OT) task. Specifically, we treat the discovered interest factors from ratings and texts as the supporting points of two discrete distributions, and we seek to align them by computing the optimal transport plan. By leveraging the Sinkhorn algorithm, which solves an entropic regularized OT problem in our formulation, \ourmethod\ efficiently learns the alignment matrix between the factors in a data-driven manner, avoiding the suboptimal performance might stem from hiring a fixed alignment. 

% Thirdly, \ourmethod\ employs two complementary methods for transferring user preference signals between ratings and texts. The first method integrates an alignment probability-guided regularization term into the model's learning objective. This regularization encourages \ourmethod\ to align the interest factors from both modalities in a way that reflects their inherent relationships. The second method utilizes a barycentric mapping strategy, which transfers information between modalities by projecting factors from the rating space to the text space (and vice versa). This projection enables \ourmethod\ to reconstruct ratings using textual information and vice versa, allowing mutual supervision between the two modalities. By leveraging this cross-modal supervision, \ourmethod\ ensures that both rating and text signals contribute to refining user and item representations, ultimately improving recommendation accuracy.
% % Through these innovative approaches, \ourmethod\ not only addresses the limitations of previous models but also sets a new standard for textual content-aware recommendation systems. 
% Extensive experiments on real-world datasets demonstrate that \ourmethod\ consistently outperforms competitive baselines, showcasing its ability to capture and align user preferences more effectively across different datasets.

\textbf{Contributions. }
Our contributions are threefold. First, we bridge the gap in text-aware recommendation by the novel use of optimal transport (OT) for preference alignment. Second, we propose \ourmethod: 1) leverages OT to adaptively aligning rating and text interest factors, and 2) utilizes barycentric mapping and OT-guided regularization for cross-modal interest transference. Third, we validate \ourmethod's effectiveness through extensive experiments on real-world datasets, demonstrating its superiority over existing models. 
In addition, we provide qualitative analysis to offer insight into the inner workings of our proposed optimal transport-based alignment of preference factors.