\section{Rebuttal}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Reviewer BiEo}
Dear Reviewer BiEo,

We appreciate your thoughtful comments. The core motivation of our work is to address two limitations in prior works, which we believe are key bottlenecks for their performance.

(1) Rigid Alignment of Interest Factors:
Prior methods impose fixed one-to-one alignments between interest factors (e.g., aligning the first text-based factor with the first rating-based factor), shared across all users. This ignores user-specific variations—some users may exhibit many-to-one or one-to-many interest correlations. Our model instead learns soft, user-dependent alignments via an optimal transport-based approach, allowing more flexible and personalized cross-modal interactions. As shown in Figure 2, this flexibility leads to significantly better performance than rigid alignment (e.g., Diagonal), validating the importance of addressing this limitation.

(2) Uniform Treatment of Modalities:
Prior methods also treat rating- and text-based factors equally (e.g., simple averaging), assuming a universal importance of both modalities. However, users vary in how much they rely on textual versus rating signals. Our model adapts the fusion weights per user, capturing this personalized modality preference. As shown in Figure 3, this adaptive fusion yields clear improvements over fixed-weight baselines like ADDVAE.

(3) Regarding the use of item text:
We do not use static item-level texts. Instead, we aggregate the text of the items a user has interacted with to create a user-specific text representation (via tf-idf averaging). This representation reflects a user’s personal preferences from a textual perspective and justifies treating text and rating inputs symmetrically in our architecture (Figure 1). Their complementary roles help mitigate the sparsity and limitations of using ID-based signals alone. This setting is also adopted in baseline like ADDVAE for fair comparison. We also include baselines leveraging item-level text MDCVAE, TopicVAE. As shown in Table 1, they are all outperformed by our proposed model.

(4) Architecture and Item Prototypes:
We use a VAE-based framework to ensure fair comparison with baselines (e.g., ADDVAE, MDCVAE, TopicVAE, VALID). While alternative architectures are possible, our improvements stem from the alignment and fusion mechanisms, which is the focus of this work. As for item prototypes, clustering is essential for modeling multi-interest users, a proven strategy in prior work (MacridVAE, VALID, etc.) and validated again in our experiments (Table 1).

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Reviewer EQzS}
Dear Reviewer EQzS,

Thank you for highlighting these important points. We address the key concerns below:

(1) On Novelty and Complexity:
While BANDVAE builds upon the VAE foundation, its core novelty lies in introducing user-specific alignment across modalities via optimal transport. Prior works like Tran \& Lauw [2022] and Ma et al. [2019b] did not address the challenge of aligning user interests in a personalized fashion when these interests are expressed differently across ratings vs. texts. Our use of OT-based alignment and adaptive fusion tackles this gap explicitly. Though the model introduces added complexity, this is the cost of modeling richer, user-aware cross-modal relationships. As shown in Tables 2 and 8, this complexity leads to clear performance gains.

(2) On Handling Sparse or Noisy Texts:
To mitigate noise and sparsity, we apply standard text-cleaning techniques during preprocessing, such as frequency-based filtering via tf-idf and stop-word removal. We also aggregate the text of a user’s interacted items to form a denser user-level text signal. These strategies help ensure that even short or noisy item descriptions contribute meaningful information. Moreover, these steps are employed across baselines, ensuring fair comparison. We keep these pre-processing steps at minimal complexity so that the performance gain is attributed to our proposed aligning mechanism. Employing advanced method to generate clean text would potentially enhance our proposed framework.

3. On Multi-Modal Generalization:
While our current focus is on ratings and text, the OT-based alignment and fusion framework could be extended to other modalities like image or audio. We acknowledge that image and audio data present unique challenges, specifically, the lack of explicit semantic units like words in text domain. However, emerging methods like Slot Attention [Locatello et al., 2020] offer a promising direction to extract such units from image. Integrating these advances with BANDVAE is a natural next step to extend BANDVAE to other modalities.

[1] Locatello et al. Object-Centric Learning with Slot Attention. NeurIPS 2020.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Reviewer zbYj}
Dear Reviewer zbYj,

We appreciate your insightful suggestion on using pre-trained multi-modal models like CLIP or LLMs for item embeddings. While such embeddings could enrich item representations, our focus is on evaluating the effectiveness of optimal transport-based alignment for cross-modal preference modeling, rather than improving raw input quality.

To ensure a fair and controlled comparison, we follow existing baselines that rely on raw textual features rather than pre-trained embeddings. Moreover, embeddings from CLIP or LLMs compress content into a single vector, which can obscure disentangled interest factors—making them less compatible with our multi-interest VAE framework. Integrating these embeddings meaningfully would require substantial architectural adaptation, which we consider an exciting direction for future work.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\subsection{Reviewer fs9n}
Dear Reviewer fs9n,

We appreciate your the helpful suggestions. We agree that providing stronger intuition behind our design choices would enhance clarity, and we will revise Section 3 to include a brief task overview, motivating examples, and intuitive guidance throughout.

(1) Design Motivation and Intuition with An Example:
Consider a user $u$ interested in VR devices, headphones, and keyboards. BANDVAE is designed to identify and align these interest factors from both ratings $\textbf{r}^u$ and associated textual content $\textbf{t}^u$. The encoder extracts multiple modality-specific interest vectors, and the optimal transport (OT) module then learns personalized alignments—for example, matching the headphone factor from ratings to its counterpart from text—rather than relying on rigid one-to-one alignments (as in ADDVAE). This flexibility is captured by the $\pi$-guided term in Eq. (4), which encourages nuanced, user-specific cross-modal alignment.

The A-guidance in Eq. (9) integrates item clustering: if an item strongly belongs to a certain prototype (e.g., `VR devices'), its predicted score is amplified accordingly. This encourages specialization among interest factors and improves recommendation precision.

Barycentric mapping and adaptive fusion are introduced to address the varied importance of modalities: some users prioritize collaborative signals (ratings), while others rely more on item descriptions (texts). These components dynamically adjust the fusion of modalities based on user-specific preferences, which is validated by our ablation results in Figure 2 and Figure 3.

(2) Inference and VAE Prior: 
The inference process mirrors training: we compute predicted scores via Eq. (9) and recommend the top items unseen by the user. As in RecVAE, MacridVAE, VALID, TopicVAE, ADDVAE, etc., we omit the VAE prior during evaluation for stability and comparability.

We will revise the main section to include a brief VAE background and inference explanation for better self-containment, and add a running example to make the methodology more accessible.