DyGMamba: Efficiently Modeling Long-Term Temporal Dependency on Continuous-Time Dynamic Graphs with State Space Models

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: dynamic graph, state space model
Abstract:

Learning useful representations for continuous-time dynamic graphs (CTDGs) is challenging, due to the concurrent need to span long node interaction histories and grasp nuanced temporal details. In particular, two problems emerge: (1) Encoding longer histories requires more computational resources, making it crucial for CTDG models to maintain low computational complexity to ensure efficiency; (2) Meanwhile, more powerful models are needed to identify and select the most critical temporal information within the extended context provided by longer histories. To address these problems, we propose a CTDG representation learning model named DyGMamba, originating from the popular Mamba state space model (SSM). DyGMamba first leverages a node-level SSM to encode the sequence of historical node interactions. Another time-level SSM is then employed to exploit the temporal patterns hidden in the historical graph, where its output is used to dynamically select the critical information from the interaction history. We validate DyGMamba experimentally on the dynamic link prediction task. The results show that our model achieves state-of-the-art in most cases. DyGMamba also maintains high efficiency in terms of computational resources, making it possible to capture long temporal dependencies with a limited computation budget.

Supplementary Material: zip
Primary Area: learning on graphs and other geometries & topologies
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5146
35 / 35 replies shown
Add:

Paper Decision

Decisionby Program Chairs22 Jan 2025, 05:28 (modified: 05 Feb 2025, 05:44)EveryoneRevisions
Decision: Reject
Add:

Meta Review of Submission5146 by Area Chair 5yym

Meta Reviewby Area Chair 5yym17 Dec 2024, 00:36 (modified: 05 Feb 2025, 05:18)EveryoneRevisions
Metareview:

The paper employs the Mamba architecture to address the dynamic representation learning on graphs, which is arguably one of the first papers to do so. The experiments are extensive with a good range of datasets and SOTA baselines. The writing is generally detailed and clear.

However, reviewers find some concern in terms of the novelty of the approach. As the paper utilize an existing Mamba architecture, the contribution of this paper should be better clarified in terms how specifically the proposed approach leverages the unique characteristics of dynamic graphs. On a related note, reviewers also find the motivation of using Mamba to model dynamic graphs somewhat unconvincing, and remain doubtful if it is necessary to use mamba to model very long historical sequences.

Additional Comments On Reviewer Discussion:

Only one reviewer, rZap, is positive about this paper, and they did not champion it despite explicitly asked by the AC.

Other reviewers zXNQ, TJxG, Gzn2 appears unconvinced about the rebuttals and keep their ratings, while one of them, 1QFu, lowed the rating after rebuttal. The main concerns can be found above in the meta review, and more than one reviewers echoed those points.

Add:

Revision 2.0 Updated

Official Commentby Authors28 Nov 2024, 12:11 (modified: 28 Nov 2024, 12:12)EveryoneRevisions
Comment:

We thank again for the further discussions initiated by the reviewers. We have added more experiments to address further concerns stated by Reviewer TJxG and Gzn2. We have also highlighted the latest change with the yellow background.

The key change:

We have added Appendix L, in line 1230-1349 of revision 2.0, to include DyGMamba's performance on 6 DTDG datasets mentioned in DyGLib. We compare it with DyGFormer to show its effectiveness in Table 18, 19, 20 and 21. We show that: (1) DyGMamba can outperform DyGFormer on these DTDG datasets in almost cases; (2) DyGMamba benefit from larger $k$ when the number of sampled neighbors increases (e.g., on Can. Parl where the number of sampled neighbors is 2048, the optimal value of $k$ grows to 100), implying that our selection of using Mamba for temporal pattern modeling is reasonable.

We still want to emphasize that we design DyGMamba to model CTDGs. To address reviewers' concerns, we are presenting the experiments on DTDGs in our revision 2.0. Our results show that even our model is not designed for DTDGs, it still excels in modeling them. We hope the reviewers could understand and we are looking forward to further discussion.

Add:

Revision Updated

Official Commentby Authors23 Nov 2024, 23:56Everyone
Comment:

We thank all reviewers for the effort in reviewing. We have updated our submission. We have also highlighted our changes with yellow background in our new revision. The line numbers stated below are for the new revision, not for the first version.

Here are the key changes:

  1. Line 129. We have fixed the dimension typo in our first version.
  2. Line 134-144 and 1143-1161, Appendix H. We have provided more details about the SSM operation over vector sequences. We have also explained detailedly what is Single-Input Single-Output.
  3. Line 175-176 and 284-285. We have added explanation of what is the critical information and how can it be selected.
  4. Line 264-265 and 1163-1171, Appendix I. We have explained why we use SSM to model temporal patterns.
  5. Table 3, 4, Ablation study, line 404-408 and 412-416. We have added ablation C and D to verify the importance of dynamic information selection with edge-specific temporal patterns. We have developed Variant C and D based on Reviewer Gzn2's suggestion in the second point of Question 3. We have proven that modeling temporal patterns improves dynamic information selection and helps the model to achieve better performance.
  6. Table 5. We have also provided the performance of Variant C and D on synthetic datasets. We have again verified that temporal pattern modeling is very contributive in CTDG modeling.
  7. Line 510 and 1192-1218, Appendix J. Beyond Enron, we have done the analysis: "Impact of Patch Size on Scalability and Performance" for MOOC, in order to address the concern of Reviewer TJxG (Weakness 5). We find that the conclusions we have drawn on Enron are also true on MOOC.

Here are other supplementary and minor changes:

  1. Line 108-112. We have included more related works.
  2. Line 225-228. We have explained what is the Broadcast function.
  3. Line 258-260. We have explained more detailedly about batching processing.
  4. Line 510 and 1173-1191, Appendix J. We have given more explanations about the different performance trend among different models in Fig. 3a.
  5. Line 528-529 and 1129-1142, Appendix G.4. We have provided more detailed derivation of the complexities of DyGFormer and DyGMamba.
  6. Line 530 and 1220-1229, Appendix K. We have added discussion about potential limitation and solution.
  7. Line 927-932, Appendix C.1. We have explained how we choose the value of $\gamma$.
Add:

Official Review of Submission5146 by Reviewer zXNQ

Official Reviewby Reviewer zXNQ06 Nov 2024, 08:30 (modified: 12 Nov 2024, 16:18)EveryoneRevisions
Summary:

This paper studies how the Mamba model can solve the continuous-time dynamic graphs, especially focusing on the dynamic link prediction problem. The research problem is interesting, and investigating the new neural architecture for graph representation learning is exciting. A detailed review of the pros and cons can be seen in the following sections.

Soundness: 2: fair
Presentation: 2: fair
Contribution: 2: fair
Strengths:
  1. The research problem is interesting, and the link prediction is a pragmatic and important application.

  2. The study has the potential to solve the dynamic representation learning and applications from the Mamba perspective.

  3. The paper's organization is not hard to follow.

Weaknesses:
  1. [Minor] The preliminary section is not informative. In a better way, the preliminary should pave the way for illustrating the proposed method. So far, the illustration of Mamba is not clear, and the connection of the illustration with the proposed method is not well aligned. Also, the dimension in Eq (1a) seems inconsistent for the matrix computation.

  2. For Eq. (4a) the computation procedure for $Broadcast_{4d}$ is missing. Also, in Eq (4d), the computation procedure for $SSM_{A,B,C}$ is also missing. Only the sentence "similar to Eq.2" is inadequate, for example, what is the relation between $H_{\theta}^{t}$ with $p_{\tau}$ and $q_{\tau}$?

  3. The motivation and the necessity of using Mamba for dynamic graph learning are not that clear. Section 3.2 and Section 3.3 pay much attention to introducing the equation. However, the motivation and intuition for those equations are missing. For example, for the proposed "Dynamic Information Selection with Temporal Patterns", what is being selected and what is being filtered? Correspondingly, the ablation study design is reasonable, but the explanation is not adequate.

  4. The experimental design follows the SOTA, i.e., DyGformer, which makes it good and easy to evaluate the performance of the proposed DyGMaba. Why are the selected datasets and sampling methods truncated from the full version of DyGformer? For example, 6 of 13 given datasets are not included, and one sampling method is not fully explored.

  5. Between lines 526 to 529, more derivation for the complexity conclusion will be appreciated.

  6. Overall, the theoretical contribution seems incremental.

Questions:

Please consider the raised concerns in the above section.

Flag For Ethics Review: No ethics review needed.
Rating: 3: reject, not good enough
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.
Code Of Conduct: Yes
Add:

Response to Reviewer zXNQ: Part A

Official Commentby Authors18 Nov 2024, 11:29 (modified: 18 Nov 2024, 16:22)EveryoneRevisions
Comment:

Thank you for the review. We have clarified some potential misunderstandings and have corrected some technical details. We will release our revision and update the pdf. Please see our response below for details.

Weakness 1

Please note that our introduction follows the original paper of Mamba [1]. In [1], authors first introduce S4 with their Eq. 1, 2 and 4 which are same as Eq. 1a and 2 in our paper. Mamba's authors then mention in their Section 3.2 that the difference between Mamba and S4 is that Mamba introduces input-dependent parameters that are used to achieve the selection mechanism, which also corresponds to line 138-140 of our submission. We have explicitly specified how Mamba parameters are computed given the input data in our Eq. 4 and 6. To better align the contents, we will add more clarifications in the revision.

As for the dimension confusion in Eq. 1a, we really appreciate that you have spotted this. The dimension of $\mathbf{B}$ and $\mathbf{C}$ should be $\mathbb{R}^{d_1 \times 1}$ and $\mathbb{R}^{1 \times d_1}$, respectively. We will correct the dimension error in the revision.

Weakness 2

Please note that all these notations inherit from the original paper of Mamba [1], including what is Broadcast($\cdot$) (page 5 last paragraph of [1]) and the SSM$(\cdot)$ operation (Algorithm 1 and 2 of [1]). Additionally, other related mamba papers also use the same notations, such as Vision Mamba [2] (e.g., its Algorithm 1).

To further clarify, Broadcast$_{4d}(x)$ is an operation that takes input $x \in \mathbb{R}^{l \times 1}$ and expands it into a matrix of dimension $\mathbb{R}^{l \times 4d}$ by copying $x$ for $4d$ times. We use this notation following [1] (as in its first sentence of the last paragraph on page 5).

As for SSM$(\cdot)$, we think the confusion mainly comes from our omitting of how the representation of the complete sequences is computed, and only specifying how the next state is computed in our Eq. 2. We are adding the detailed form of the SSM operation in our revision. Basically, it is just as same as Eq. 3a and 3b in [1] (also same as Eq. 3 in [2]), doing a convolution along the whole sequence given the SSM parameters $\overline{\mathbf{A}}$ , $\overline{\mathbf{B}}$ and $\mathbf{C}$. Please also notice that in line 136-141, we have also explained that SSM is implemented in an SISO fashion when the input dimension (called channel in [1]) is greater than 1, strictly following [1] (its page 4 "Structure and Dimensions."). This means if we input a sequence of high dimensional vectors into an SSM, the SSM processes each input dimension in parallel with the same set of parameters. In our Eq. 2, $p_\tau$ and $q_\tau$ are one dimensional inputs and they only denote one element in a sequence. $\mathbf{H}^t_{\theta}$ is an input matrix, where each row denotes an input vector. It can be viewed as a sequence of vector inputs arranged in the matrix form. We believe that we have put sufficient explanations in our paper. But for better readability, we are putting a more detailed explanation in our revision.

Add:

Response to Reviewer zXNQ: Part B

Official Commentby Authors18 Nov 2024, 11:40 (modified: 18 Nov 2024, 13:14)EveryoneRevisions
Comment:

Weakness 3

We have clearly stated our motivation in our abstract and introduction. Let us re-clarify it step-by-step:

  1. Recent methods for dynamic graph representation learning such as DyGFormer and CTAN have found that modeling long-term temporal information is necessary and contributes greatly. So we follow this finding and wish to propose a model that can achieve that. This corresponds to line 43-53 of our submission.

  2. The current state-of-the-art DyGFormer incorporates long-term temporal information by modeling very long historical node interaction sequences. Due to the high complexity of the Transformer in sequence modeling, DyGFormer requires much more computational resources as the sequence length increases. This poses challenge in efficiently modeling very long node interaction sequences (mentioned as the first challenge in line 52-53.) Due to SSMs' low complexity in sequence modeling, we choose Mamba SSM as an alternative to improve efficiency in long sequence modeling (line 57-60).

  3. Another challenge rises when we model long node interaction sequences. Long sequences introduces more information, causing burden to the models in distinguishing useful temporal information from redundant parts (line 53-56). Note that the claim "long-term temporal information is useful in dynamic graph modeling" does not contradict the claim "we need to select the important part and discard the redundant information". To address this, we propose to use the learned temporal patterns to help select information. We have used an example to illustrate our motivation for leveraging temporal patterns to achieve this (line 61-69). Temporal patterns can help to prioritize information selection from relevant node interactions. The relevance of interactions is represented with weights ($\beta_\theta$ in Eq. 7) which are derived based on temporal patterns.

Weakness 4

Please note that only 7 datasets from DyGLib are continuous-time dynamic graphs (CTDGs), and we have considered all of them. CTDGs are completely different from discrete-time dynamic graphs (DTDGs). We have explained their differences in line 35-42 and have explicitly specified that we only consider CTDG modeling in our title, abstract and everywhere else.

We have also mentioned and given explanation why we do not consider historical negative sampling evaluation in inductive setting in line 327-332 and Appendix D. We wish you could pay attention to our contents more carefully.

Weakness 5

Thank you for the suggestion, we will include more detailed complexity analysis in our appendix.

Weakness 6

We appreciate and respect your opinion. But we have to emphasize that from our perspective, ICLR should not be a platform where only the papers delivering substantial theoretical contributions are valued. We admit that our submission is a more technical paper rather than a theoretical one. Our contributions have been clearly outlined in line 70-77. We believe these contributions are enough to make our work appear at ICLR and they also meet the interest of the community. We hope you could understand.

[1] Gu, Albert, and Tri Dao. "Mamba: Linear-time sequence modeling with selective state spaces." arXiv preprint arXiv:2312.00752 (2023).

[2] Zhu, Lianghui, et al. "Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model." Forty-first International Conference on Machine Learning.

Add:
Replying to Response to Reviewer zXNQ: Part B

Official Comment by Reviewer zXNQ

Official Commentby Reviewer zXNQ03 Dec 2024, 06:50Everyone
Comment:

Thanks for the reply. After reading the reply and other reviews, I tend to keep my current rating.

Add:

Official Review of Submission5146 by Reviewer TJxG

Official Reviewby Reviewer TJxG04 Nov 2024, 08:42 (modified: 12 Nov 2024, 16:18)EveryoneRevisions
Summary:

This paper proposes a novel continuous-time dynamic graphs (CTDGs) representation model named DyGMamba to address the challenges of learning long-term temporal dependencies. DyGMemba consists of a node-level state space model (SSM) and edge-level SSM, which aims to learn the node-level and edge-level representations, respectively. The outputs of node-level SSM and edge-level SSM are combined for future link prediction. The authors conduct extensive experiments to evaluate DyGMamba.

Soundness: 2: fair
Presentation: 2: fair
Contribution: 1: poor
Strengths:
  1. The experiments are extensive.
Weaknesses:
  1. The novelty of this work is low. Some important modules are highly similar to existing works. In specific:

    • In Section 3.1, The "Encode Neighbor Features" and " Patching Neighbors" sections are highly similar to DyGFormer [1].
    • The main network architecture is taken from original Mamba [2], without incorporating structure information of dynamic graph (see Eq. 5, 6 and 7).
  2. The authors claim that applying Mamba for dynamic graph learning is to address the challenge of computation complexity in capturing very long-term temporal dependencies. This raises several concerns:

    • Why should we learn from very long history? Intuitively, very long history should be discarded since it has minor impact on current event. In addition, in Fig 3 (a), the performance of DyGFormer and variant A is not increasing as sequence length increases. This supports that the model performance does not necessarily increase as sequence length increases.

    • In addition, I think DyGFormer has address the problem of learning long-term dependencies by its patching technique. It can learn very long history if its patch size is large enough.

  3. What is "critical temporal information" mentioned in Line 54? How to define it? Is there evidence that DyGMamba can indeed capture this?

  4. The performance improvement of DyGMamba is marginal (Table 1 and 2). Most improvements are within 0.5%.

  5. In Fig.3, why the experiments are only conducted on Enron? More datasets should be included.

[1] Yu L, Sun L, Du B, et al. Towards better dynamic graph learning: New architecture and unified library[J]. Advances in Neural Information Processing Systems, 2023, 36: 67686-67700.

[2] Gu A, Dao T. Mamba: Linear-time sequence modeling with selective state spaces[J]. arXiv preprint arXiv:2312.00752, 2023.

Questions:

See the weakness.

Flag For Ethics Review: No ethics review needed.
Rating: 3: reject, not good enough
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.
Code Of Conduct: Yes
Add:

Response to Reviewer TJxG: Part A

Official Commentby Authors18 Nov 2024, 13:11 (modified: 18 Nov 2024, 18:20)EveryoneRevisions
Comment:

Thank you for the review. We have clarified some potential misunderstandings and have corrected some technical details. We will release our revision and update the pdf. Please see our response below for details.

Before providing our detailed response, we would like to point out that while your review indicates a confidence level of 5, it contains serious misunderstandings (see our response to Weakness 2). We believe the review should reflect greater responsibility, especially given the stated confidence of 5. We feel that such a high-confidence review should avoid these errors. We appreciate your effort in reviewing and hope our response can help you and other reviewers better judge our work.

Weakness 1

We appreciate and respect your opinion. Please consider this issue from three aspects:

  1. We kindly ask you to refer to various recent publications that apply Mamba to different domains. For example, Vision Mamba [1] transforms image patches into a sequence and directly use Mamba to encode it, similar to the idea of ViT [2]. These publications have gained great attention from the related communities and we believe that they share the same type of "novelty" as our work. In our opinion, making Mamba work in a new domain (in our paper it is the domain of dynamic graph reasoning) is already a noticeable contribution. We have also proved with extensive experiments (Section 4 and Appendix G) that our DyGMamba can actually achieve strong performance and high efficiency simultaneously.

  2. Besides, we have proposed a new module based on Mamba which dynamically selects the critical information from node interaction histories, and it is particularly designed for improving continuous-time dynamic graph (CTDG) modeling. We have demonstrated with experiments (Table 3, 4 and 5) that this dynamic information selection module is important for our model. We believe that this is a great contribution to the domain of dynamic graph reasoning.

  3. DyGFormer's neighbor feature encoding is effective so we encode neighbors based on it. Besides, DyGFormer's patching improves model efficiency because the sequence length will be divided by the patch size. To ensure a fair efficiency comparison, we preserve patching and maintain identical input interaction sequence lengths for both DyGMamba and DyGFormer (as indicated in line 319-322). We have also demonstrated with experiments (line 468-504 and Fig. 3) how patching affects DyGMamba. We have shown that patching hurts DyGMamba's performance and parameter efficiency. So when we care more about performance and parameter efficiency, patching is not needed for DyGMamba, which is different from DyGFormer.

  4. Finally, we want to clarify that several recent CTDG methods, such as DyGFormer and GraphMixer, only care about the node interaction sequences. This has been proven to be effective. Since a CTDG is a stream of events, modeling interaction sequences is also a natural choice. We follow this type of works when we develop DyGMamba and think it should not be the point to be criticized.

Add:

Response to Reviewer TJxG: Part B

Official Commentby Authors18 Nov 2024, 13:28 (modified: 18 Nov 2024, 13:33)EveryoneRevisions
Comment:

Weakness 2

Why should we learn from very long history? Intuitively, very long history should be discarded since it has minor impact on current event.

Longer histories provide more hints for prediction, and models should be able to distinguish which part of histories contributes the most. For example, consider a periodic event that occurs infrequently and is surrounded by numerous other events between each occurrence. A model focusing on short-term history cannot capture this periodicity, while the model that learns from long-term history can probably capture it.

In Fig 3 (a), the performance of DyGFormer and variant A is not increasing as sequence length increases. This supports that the model performance does not necessarily increase as sequence length increases.

There is a serious misunderstanding here. The amount of considered temporal information is decided by the number of sampled historical node interactions $\rho$, while the sequence length $\rho/p$ depends on both $\rho$ and patch size $p$. Sequence length alone cannot accurately reflect the amount of temporal information considered. In Fig. 3a all models share the same amount of temporal information. We use this figure to show that decreasing patch size makes DyGMamba better perform, while leading to inferior results on DyGFormer and Variant A. Also, the claim of "capturing long-term temporal dependencies helps dynamic graph reasoning" has been well discussed in DyGFormer [3]. Please refer to [3] for detailed explanations.

In addition, I think DyGFormer has address the problem of learning long-term dependencies by its patching technique. It can learn very long history if its patch size is large enough.

It is not reasonable to say patching "has addressed" this issue. First, increasing patch size substantially increases model parameters (shown in Fig. 3b). Imagine now we need to model extremely long historical node interaction sequences. In this case, patching introduces significantly more parameters, demanding more training data and epochs for effective model optimization. This raises concerns that using a very large patch size for modeling long sequences could hinder optimization, making patching less likely to be an optimal solution. Besides, we show in Fig. 3b that given the same patch size, DyGMamba consumes much fewer parameters than DyGFormer. This means when we model extremely long histories, DyGMamba is a better choice than DyGFormer since fewer parameters makes DyGMamba easier to be optimized. Finally, given fixed computational budget, DyGMamba maintains a smaller patch size than DyGFormer. For example, in Fig. 3c and 3d, DyGMamba with $p = 2$ requires 4152 MB GPU memory and 190s per epoch training time, while DyGFormer with $p = 4$ requires 4442 MB and 192s. This implies that with a limited computational budget, we can tune DyGMamba with a smaller patch size, which further lowers the difficulty in model optimization.To summarize, we agree that patching is effective, but it is far from completely addressing the problem and DyGMamba can provide a more flexible choice.

Add:

Response to Reviewer TJxG: Part C

Official Commentby Authors18 Nov 2024, 15:05Everyone
Comment:

Weakness 3

  1. Our example in line 59-69 has answered your question. It indicates that in dynamic graph reasoning, some historical node interactions may be misleading in prediction. So the "critical temporal information" here means the historical interactions that should be more focused on and can lead to the correct prediction.

  2. We have also explained how DyGMamba selects critical information in line 268-275. Given a pair of nodes $u$ and $v$, we wish to build connection between them based on their historical temporal pattern. The temporal neighbors of $u$ and $v$ are encoded separately in the node-level SSM. Before dynamic information selection, each node has an encoded neighbor sequence, where each element in this sequence corresponds to an encoded temporal neighbor. To build connection between $u$ and $v$, we use the learned temporal pattern to help to compute a weighted sum of encoded temporal neighbors for each node and take this sum as the input into the prediction head. For example, to compute a weighted sum $\mathbf{h}^t_u$ for node $u$ , we first compute an $\alpha_u$ based on the encoded temporal pattern $\mathbf{h}^t_{u,v}$ and $v$'s temporal neighbors $\mathbf{H}^t_v$ (Eq. 7a, 7b, 7c). $\alpha_u$ is then used to compute a score with each row in $u$'s encoded neighbors ($\mathbf{H}^t_u$). The scores are normalized and lead to a weight vector $\beta_u$ that is applied to compute a weighted sum of $u$'s temporal neighbors, i.e., $\mathbf{h}^t_u$. Please note that the weights in $\beta_u$ are decided by the temporal pattern as well as $v$'s encoded neighbors. This builds strong connection between $u$ and $v$. Each number in $\beta_\theta$ (Eq. 7d) denotes the importance of each of $\theta$'s temporal neighbors. In this sense, "critical temporal information" means the temporal neighbors assigned with greater weights in $\beta_\theta$.

Weakness 4

We have clearly specified in our title and introduction that we are aiming to propose an efficient model that can capture long-term temporal dependencies. We kindly ask you to pay more attention to the efficiency analysis in this work. As for performance, we are also achieving state-of-the-art. We have provided all implementation details in Section 4.1 and appendices. We have also provided our code for your validation. We believe that this is not a weakness and hope you could understand.

Weakness 5

Enron is relatively small compared with other two long-range temporal dependent datasets. The whole Fig. 3 consists of 12 experiments even with one random seed, so we chose Enron as an representative.

[1] Zhu, Lianghui, et al. "Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model." Forty-first International Conference on Machine Learning.

[2] Dosovitskiy, Alexey. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

[3] Yu, Le, et al. "Towards better dynamic graph learning: New architecture and unified library." Advances in Neural Information Processing Systems 36 (2023): 67686-67700.

Add:

Response to authors

Official Commentby Reviewer TJxG22 Nov 2024, 06:26 (modified: 22 Nov 2024, 06:41)EveryoneRevisions
Comment:

I appreciate the author's detailed response to my question. However, after viewing authors' response to all reviewers and updated manuscript, many of my concerns are not addressed. I restate some my concerns as follows.

  • Motivation (initial Weakness 2). The motivation of this work does not hold. The main motivation of this work is stated in Paragraph 2 of Introduction, " How to develop a model that is scalable in modeling very long-term historical interactions?". However, from my opinion, it is unnecessary to develop a model that is capable of modeling very long-term historical interactions based on following experimental evidence. As stated by authors, the amount of considered temporal information is decided by the number of sampled historical node interactions. I notice following experimental results: 1) From the Fig.6 of your updated PDF, on Enron dataset, it can be seen that the performance of TGN, CAWN and CTAN drop as the number of sampled neighbors increases when the neighbor number is larger than 64. 2) From the Fig.2 of DyGFormer paper [1], on LastFM and MOOC datasets, the performance DyRep, TGN, TCL and DyGFormer does not always increase as the number of sampled neighbors increases. These experimental evidence indicates that for many dynamic graph models, modeling very-long interaction history will hurt the performance. Therefore, the motivation of this paper does not hold.

  • Model (initial Weakness 1). Given that aforementioned motivation does not hold, then it is unnecessary to introduce to corresponding module, named Mamba, into dynamic graph modeling. Therefore, introducing Mamba into dynamic graph model is a trivial combination, which does not address meaningful problems in the dynamic graph learning. In addition, most of the essential modules of this work, including the Encode Neighbor Features, Patching Neighbors and SSM block, are ported over from existing works.

  • The experimental performance (initial Weakness 4). My initial concern still holds. The performance improvement of DyGMamba compared to baselines is marginal. The authors claims in the rebuttal that "we are aiming to propose an efficient model that ...". Does it mean “Your work doesn't care about performance improvement, only on efficiency”

  • About efficiency. If the authors want to demonstrate the efficiency of DyGMamba when modeling long-term temporal dependency, they should compare the efficiency metrics (e.g., training time, GPU memory) of DyGMamba with all baseline methods, not only DyGFormer (including the experiments of Fig. 3 and 4).

  • Limited experiments (initial Weakness 5). After rebuttal, the authors do not include the experiments on new datasets. Since the conditions of different datasets are very different, the conclusions based on Fig.3 made by authors do not convince me.

[1] Yu L, Sun L, Du B, et al. Towards better dynamic graph learning: New architecture and unified library[J]. Advances in Neural Information Processing Systems, 2023, 36: 67686-67700.

Add:
Replying to Response to authors

Response to Further Concerns

Official Commentby Authors22 Nov 2024, 13:58Everyone
Comment:

Thank you very much for initiating further discussion! We totally respect your opinion and wish to give you more clarification.

it is unnecessary to develop a model that is capable of modeling very long-term historical interactions

It is not unnecessary to model very long-term historical interactions. It is about how to effectively do it. As you have mentioned, most previous models do not have the ability to improve their performance given long-term historical information. This is due to previous methods' inability in capturing the long-term temporal information and cannot prove that more temporal information is useless. Please also note that in Fig. 6, DyGMamba and DyGFormer can benefit from much longer histories. This proves that it is critical to develop more advanced models to effectively leverage long-term information. We believe this point has been well-discussed in DyGFormer. Please have a look at their discussion, especially in the second paragraph of Sec. 5.4 regarding its Fig. 2.

Given that aforementioned motivation does not hold, then it is unnecessary to introduce to corresponding module, named Mamba, into dynamic graph modeling. Therefore, introducing Mamba into dynamic graph model is a trivial combination, which does not address meaningful problems in the dynamic graph learning.

We understand your concern. But still, we believe we have justified our motivation in the last paragraph, so we think introducing Mamba into dynamic graph modeling can address a meaningful problem: effectively and efficiently model long-term temporal information.

In addition, most of the essential modules of this work, including the Encode Neighbor Features, Patching Neighbors and SSM block, are ported over from existing works.

We do have proposed a new dynamic information selection module based on a time-level SSM, in addition to introducing Mamba into dynamic graph reasoning. We have clearly stated our motivation and have paid a lot of attention on justifying the merit it brings (e.g., the experiments in ablation A, on synthetic datasets, and in efficiency analysis Fig. 3). Also, we still believe that the inclusion of SSMs into dynamic graph representation learning is well-motivated, reasonable, and not trivial (as discussed in our paper and throughout our rebuttal).

Additionally, we kindly ask you to pay attention to another paper accepted to last year's ICLR: FreeDyG: Frequency Enhanced Continuous-Time Dynamic Graph Model for Link Prediction (https://openreview.net/forum?id=82Mc5ilInM). If you could take a closer look, you would find that it is a slight combination of DyGFormer and GraphMixer, accompanying a newly-proposed frequency enhancing layer. FreeDyG leverages same neighbor sampling and neighbor encoding as DyGFormer, and borrows GraphMixer's mixing layer for information fusion among sampled neighbors. As stated in their paper and rebuttal, their contribution is the proposal of a Node Interaction Frequency Encoding (which is similar to DyGFormer's Neighbor Co-occurrence Encoding), as well as the frequency enhancing layer which employs Fourier transform to enhance the hidden state of encoded one-hop temporal neighbors. We appreciate their work and believe that our work is at least as good as theirs, considering the type and the amount of contribution to the community.

Does it mean “Your work doesn't care about performance improvement, only on efficiency”

We think there is a misunderstanding here. We mentioned in rebuttal that we wish to develop an efficient model. That doesn't mean that we do not care about performance. Our main goal is to develop a model both efficient and effective. We highlighted efficiency in rebuttal since your focus were biased towards the performance and we wished to make you pay more attention to efficiency. Again, as for performance, we have already achieved state-of-the-art. Even on some datasets we cannot perform the best, we are achieving the best overall rank across all 10 baselines on 7 datasets. We respect your opinion and hope to address and alleviate your concerns.

they should compare the efficiency metrics (e.g., training time, GPU memory) of DyGMamba with all baseline methods,

We are afraid you have missed important parts in our paper. We have already provided complete efficiency statistics among all baselines and DyGMamba in Sec. 4.3 and Appendix G.1. Please have a closer look.

After rebuttal, the authors do not include the experiments on new datasets. Since the conditions of different datasets are very different, the conclusions based on Fig.3 made by authors do not convince me.

The rebuttal is still ongoing. We have been working on our revision and will present your requested experiments directly in the pdf. Sorry for causing misunderstanding.

Add:

Official Comment by Reviewer TJxG

Official Commentby Reviewer TJxG28 Nov 2024, 08:54 (modified: 28 Nov 2024, 08:57)EveryoneRevisions
Comment:

I appreciate for author's further response and additional experiments. Below are my primary concerns currently:

  • I still question the necessity of utilizing very long historical information. This is because DyGformer, due to its use of the Transformer model, is already capable of effectively leveraging long-term temporal information. However, when the neighbor sequence becomes particularly long, performance tends to decline (see Fig.2 of DyGFormer paper). In addition, DyGFormer tested performance with a sequence length of 2^11, while the maximum length used in the DyGMAMBA experiments demonstrating performance variation with sequence length is 2^8, as seen in Figure 6. Therefore, it is questionable to see whether the performance of DyGMamba will increase with very long historical information (far longer than that of DyGFormer). I am not suggesting authors to add more experiments but expressing my concern.

  • In terms of model design, it's undeniable that this paper primarily utilizes existing modules (e.g., Mamba and encodings from DyGFormer) as the main components of their model. The only novel module seems to be dynamic information selection module (eq. 6). Also, I think it is unfair to say "your contribution is larger than FreeDyG". The contribution amount of two papers is hard to compare.

Due to above concerns, I decide to maintain my original score.

Add:
Replying to Official Comment by Reviewer TJxG

Response to Further Concerns: Part B

Official Commentby Authors28 Nov 2024, 14:37Everyone
Comment:

Thank you very much for clarifying your current concerns! We hope our following response can at least mitigate them.

I still question the necessity of utilizing very long historical information. This is because DyGformer, due to its use of the Transformer model, is already capable of effectively leveraging long-term temporal information. However, when the neighbor sequence becomes particularly long, performance tends to decline (see Fig.2 of DyGFormer paper).

As you have pointed out, DyGFormer's performance starts to drop when the neighbor sequence becomes particularly long (i.e., > 512 on LastFM and > 2048 on Can. Parl). 512 on LastFM and 2048 on Can. Parl are already very large numbers in temporal neighbor encoding. It is already indicating that the model is using very long historical information. The reason for claiming 512 as a large number is because as shown in Table 7 of DyGFormer (arXiv version, Appendix), previous models such as GraphMixer and TGAT only consider maximally 30 temporal neighbors when they are used to reason over the same datasets. The strong capability of DyGFormer in modeling very long historical information let it benefit from much longer sequences and achieve better performance. 512 and 2048 are not very huge numbers intuitively, however, compared with previous models that can only process at most 30, these numbers are very big. So we think we follow DyGFormer and claim that it is important to utilize very long historical information is reasonable.

In addition, DyGFormer tested performance with a sequence length of 2^11, while the maximum length used in the DyGMAMBA experiments demonstrating performance variation with sequence length is 2^8, as seen in Figure 6. Therefore, it is questionable to see whether the performance of DyGMamba will increase with very long historical information (far longer than that of DyGFormer). I am not suggesting authors to add more experiments but expressing my concern.

Thank you for raising this issue. We have included our new experiments on additional datasets in revision 2.0, Appendix L. These include the comparison between DyGMamba and DyGFormer on Can. Parl, where both models sample 2048 ($2^{11}$) temporal neighbors as historical information. We can find that DyGMamba clearly outperforms DyGFormer with a substantial margin under any setting. This proves that the performance of DyGMamba will increase with very long historical information. We hope this can address your concern.

In terms of model design, it's undeniable that this paper primarily utilizes existing modules (e.g., Mamba and encodings from DyGFormer) as the main components of their model. The only novel module seems to be dynamic information selection module (eq. 6). Also, I think it is unfair to say "your contribution is larger than FreeDyG". The contribution amount of two papers is hard to compare.

We have clearly outlined the motivation behind our model design and explained, step by step, how it addresses the challenges in dynamic graph representation learning. We fully understand and respect your perspective and deeply appreciate the effort you have put into helping us refine our paper. Our intention is not to compel agreement with our views but to engage in a constructive and respectful discussion. We hope our response has provided a clearer understanding of our work, and we welcome any additional comments or suggestions for further experiments that you believe would enhance our paper. Thank you again for your thoughtful engagement.

Add:

Official Comment by Reviewer TJxG

Official Commentby Reviewer TJxG29 Nov 2024, 02:33 (modified: 29 Nov 2024, 02:34)EveryoneRevisions
Comment:

Thanks for your response. I further clarify my suggestion as follows. If you wish to demonstrate the advantage and necessity of DyGMamba in learning extremely long historical information over DyGFormer, I suggest to report how DyGMamba performance varies with sequence length. The sequence length here should be much longer than the best configurations of DyGFormer. For instance, since DyGFormer performs best on the LastFM dataset when the sequence length is 512, you could report the model's performance at lengths of 2, 4, ..., 512, 1024, 2048, 4096, and observe whether the performance monotonically increases.

Add:
Replying to Official Comment by Reviewer TJxG

Response to Further Concerns: Part C

Official Commentby Authors02 Dec 2024, 11:59Everyone
Comment:

If you wish to demonstrate the advantage and necessity of DyGMamba in learning extremely long historical information over DyGFormer, I suggest to report how DyGMamba performance varies with sequence length. The sequence length here should be much longer than the best configurations of DyGFormer.

Great suggestion! We believe we have included related experiments in our first and following versions to demonstrate this. In addition, we think scaling the sequence length to an extremely large number is not the only way to demonstrate the advantage and necessity of DyGMamba in learning very long historical information over DyGFormer. Lets clarify this from three aspects:

  1. As mentioned in our last response, methods prior to DyGFormer consider at most 30 temporal neighbors. So we think the numbers of temporal neighbors we have considered in Enron (256), LastFM (512) and even Can. Parl (2048) have already been large enough to be called as "very long". DyGMamba's strong performance with these large numbers of sampled neighbors have proven the necessity of modeling long historical information.

  2. In Appendix G.2, Fig. 6, we have provided performance comparison on Enron between DyGMamba and DyGFormer when the sequence length scales from 1 to 256, with a fixed patch size and an increasing number of sampled temporal neighbors. We can see DyGMamba constantly outperforms DyGFormer with any sequence length, and its performance improves steadily as sequence grows longer. This means that with an increasing number of sampled neighbors, DyGMamba can better utilize more temporal information, showing its advantage against DyGFormer.

  3. In Fig. 3 and Fig. 7, we have shown that as sequence length increases, DyGMamba's performance improves while DyGFormer performs worse. In this experiment, sequence lengths are changed by changing the patch size, but not by increasing the sampled temporal neighbors. This also shows the advantage of DyGMamba in temporal sequence modeling on growing sequences.

To summarize, based on our observations, we think DyGMamba's advantage comes from its constantly better performance compared with DyGFormer given the same experimental settings (same number of sampled temporal neighbors and patch size), as well as its constantly improving performance given increasing amount of temporal information. And also, the necessity comes from the capability of DyGMamba and DyGFormer in benefiting from much longer neighbor sequences compared with previous methods. Of course, lets don't forget the superior efficiency of DyGMamba, which is one of our greatest motivation in model design.

Thank you for the suggestion. We hope our explanation can give you a better understanding of our considerations.

Add:

Official Review of Submission5146 by Reviewer rZap

Official Reviewby Reviewer rZap04 Nov 2024, 07:45 (modified: 12 Nov 2024, 16:18)EveryoneRevisions
Summary:

This paper proposes a Mamba model for representation learning on continuous-time dynamic graphs. It has a node-level SSM to encode node interactions over time and a time-level SSM to encode edge-specific temporal information. Both representations are interleaved for dynamic information selection.

Soundness: 4: excellent
Presentation: 3: good
Contribution: 3: good
Strengths:

This is arguably the first Mamba model for dynamic graph representation learning, and I think the long context modeling ability of Mamba is suitable for temporal graph learning. I appreciate the designs of two types of SSM blocks that consider both node-level and edge-level information, both of which encode critical information about temporal patterns. The proposed DyGMamba has a satisfactory performance by outperforming baseline methods with higher accuracy in link prediction, shorter training time (per epoch and in total), and less memory usage compared to DyGFormer.

Weaknesses:

The overall design of DyGMamba makes sense to me, as it basically employs SSM block for node features and edge features. I have a few questions regarding the experiments in this paper.

  • All datasets used in this paper do not have node features. Based on TGAT, I assume the authors are using all-zero vectors as node features. I wonder how DyGMamba would perform when there are node features, e.g., on GDELT dataset.

  • Meanwhile, ablation study in Tables 3 and 4 are interesting. As you shift from transductive setting to inductive setting, Variant A has slightly worse performance compared to Variant B consistently on LastFM, Enron, and MOOC. Could the authors offer theoretical insights on why this happens? I do not see any specific patterns on these 3 datasets from the statistics in Table 6. Is there any specific design of node-level SSM or edge-level SSM would cause this performance change? If yes, is there any insights that could offer to users when deploying this model to other data?

Questions:

See weaknesses part.

Flag For Ethics Review: No ethics review needed.
Rating: 8: accept, good paper
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.
Code Of Conduct: Yes
Add:

Response to Reviewer rZap

Official Commentby Authors18 Nov 2024, 15:14Everyone
Comment:

Thank you for the review and we really appreciate your recognition. Please take a look at our response and feel free to propose more follow-up questions. According to other reviewers' suggestions, we have been working on our new revision. We will update the pdf once completed. We hope our response could answer your questions and confirm your judgment.

Weakness 1

We believe the node features can be used as initialization of node representations. As for GDELT, do you mean the temporal knowledge graph? Or any other dataset? As our understanding, continuous-time dynamic graphs (CTDGs) are different from temporal knowledge graphs (TKGs) since TKGs are discrete-time dynamic graphs. Our focus in this paper is CTDGs so we do not consider TKGs. Besides, TKGs are normally separately studied in another community of knowledge graphs because they are relational graphs, so we think they are out of our scope. We follow previous works studying general dynamic graphs, e.g., DyGFormer and GraphMixer, and only focus on the popular datasets mentioned by them. If we have misunderstood your question, could you please correct us?

Weakness 2

Thank you for pointing this out. We believe that directly comparing Variant A and Variant B may not be particularly meaningful because it does not follow Control of Variables. Variant A is restrained from performing dynamic information selection by removing the time-level SSM, while still keeping the node-level SSM. Variant B removes the node-level SSM, while still keeping the time-level SSM. Directly comparing these two variants cannot tell whether node-level SSM or time-level SSM contributes to performance change.

If we have to directly compare Variant A with Variant B, the phenomenon you mentioned implies that to reason over long-range temporal dependent datasets, i.e., LastFM, Enron and MOOC, time-level SSM is more important than the node-level SSM. By contrast, if we look at the results in Table 3 and 4 regarding other datasets, we can find that Variant A performs better than Variant B. This implies that to reason over the datasets not long-range temporal dependent, node-level SSM is more important than time-level SSM. We explain the reason for these findings as follows. DyGMamba considers much more temporal neighbors when it is implemented on long-range temporal dependent datasets. More temporal neighbors introduce more temporal information, causing a greater difficulty in distinguishing the useful information from the redundant parts. In this case, time-level SSM is more important because it is the critical part that enables dynamic information selection. On the datasets where DyGMamba does not consider many temporal neighbors, the importance of time-level SSM becomes lower, and the node-level SSM becomes relatively more critical. To summarize, if we want to deploy DyGMamba to new datasets, we need to rely more on the time-level SSM and dynamic information selection if the datasets are long-range temporal dependent. Our suggestion is to keep both time-level and node-level SSMs because according to our ablation studies, both modules are important.

Add:
Replying to Response to Reviewer rZap

Official Comment by Reviewer rZap

Official Commentby Reviewer rZap21 Nov 2024, 19:32Everyone
Comment:

I really appreciate the authors' efforts in addressing my concerns -- all my concerns are addressed. I will maintain my evaluation as of now, but I will also check others' reviews to see if there is any legitimate concern during internal discussion.

Add:

Official Review of Submission5146 by Reviewer 1QFu

Official Reviewby Reviewer 1QFu01 Nov 2024, 13:41 (modified: 22 Nov 2024, 14:07)EveryoneRevisions
Summary:

This paper studies the link prediction problem on continuous-time dynamic graphs. The main technique this paper used is the Mamba state space model (SSM).

Soundness: 2: fair
Presentation: 3: good
Contribution: 2: fair
Strengths:

S1. The introduction to the proposed method is detailed and useful for the readers to understand every module and architecture. S2. The baseline methods used in this paper is state-of-the-art and recent. S3. This paper includes 7 datasets with 3 different negative sampling strategies (NSS) which makes the experimental section comprehensive.

Weaknesses:

W1. Seems that the main body of the model is based on a mature model Mamba, e.g., Eqs. 4a-4d, and Eqs. 6a-6d. Looks like there are existing works using Mamba on graphs, e.g, [1], which undermines the novelty of the proposed method.

W2. Some of the model designs are not reasonable or questionable to me. I will detail them as follows.

W2.1 Lines 182-190 introduce the counting of interaction frequency. However, it is still pretty vague why the frequency features are formed like that in lines 184 and 185. E.g., why there are 5 rows in F_u^t, and why both F_u^t and F_v^t have 2 columns. In addition, why the encoding of 2 columns of F_u^t uses a shared MLP (in line 189)? In other words, why not use an MLP: from R^2 to R^{d_f}?

W2.2 Lines 191 and 192 mentioned that it can save computation resources. If I understood correctly, by such a patching, the time complexity to encode the X_u^t will increase because the matrix has more columns (features), but the SSMs module can be faster because there are fewer rows (in lines 198 and 199), right?

W2.3 The section named “Dynamic Information Selection with Temporal Patterns” (line 255) seems to be the core module of the proposed method, which is related to the “edge-specific temporal pattern”. However, I think the edge information has been included in the Node-Level SSM Block (Eq. 3). Then, why repeatedly use the edge information, and what is any rationale regarding this?

W3. Regarding the experimental results, looks like under the “random” negative sampling strategy (NSS) the proposed method performs very well, but under the historical and inductive NSS setting, the proposed method can only beat ~ half of the SOTAs.

W4. The writing of this paper is somewhat sloppy. I detail them as follows. W4.1 As some modules of this paper are modified from the standard S4 and Mamba SSM, I checked the dimension and introduction for the backbone model between lines 125 and 140. However, a lot of them are confusing. (i) In Eq 1a, do matrices Az(\tau) and Bq(\tau) with the same dimensions? Looks like Bq(\tau) is of the shape 1 x d1 but the matrix A is of the shape d1 x d1, so the matrix Az(\tau) is of the shape d1 x (?), which cannot have the same shape as Bq(\tau). The same problem to r(\tau), which is a scalar, but from the second eq in Eq 1a, r(\tau)=Cz(\tau), and the shape of the matrix C is d1 x 1; in other words, it is not clear whether r(\tau) is a scalar or vector.

W4.2 In line 136, I think d2 was not mentioned in previous content. Also, why is the model in a SISO fashion when d_2>1? I am not familiar with this backbone model, but this statement is not intuitive to me.

W4.3 The end of Line 161 should be |Nei_u^t| but not Nei_u^t.

[1] Behrouz, Ali, and Farnoosh Hashemi. "Graph mamba: Towards learning on graphs with state space models." Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024.

Questions:

Please check the questions mentioned in the weaknesses.

Flag For Ethics Review: No ethics review needed.
Rating: 3: reject, not good enough
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.
Code Of Conduct: Yes
Add:

Response to Reviewer 1QFu: Part A

Official Commentby Authors18 Nov 2024, 15:49Everyone
Comment:

Thank you for the review. We have clarified some potential misunderstandings. We notice that you are not confident with your review. We hope our explanation will help you better understand the details of our paper and have a more accurate judgment. We are working on a new revision that aims to address the concerns raised in reviews. We will upload it once we complete. Please first see our response below to your comments.

Weakness 1

Please note that Graph Mamba [1] is only for static graphs and cannot be directly used on dynamic graphs. Our method is the first one specifically designed for dynamic graph. In short, being the first to apply Mamba to the domain of dynamic graph reasoning is a novelty, and furthermore, the proposed dynamic information selection module is novel and effective in modeling long-term temporal information on dynamic graphs. Please also refer to our response to Reviewer TJxG's proposed Weakness 1, where we have re-clarified our contributions. We hope this could improve your impression on our paper.

Weakness 2.1

We apologize for causing confusion. Lets still consider the example in line 180-186. Taking node $u$ for instance, the interacted nodes with $u$ (arranged in chronological order) taken from its interaction sequence $\mathcal{S}_u^t$ are {$a, v, a$}. So $\tilde{F}_u^t$ will have four rows, where the first three correspond to the nodes appearing in {$a, v, a$} and the last row corresponds to $v$ which appears as the opposite node in the to-be-predicted link $(u, v, t)$ (the potential interaction mentioned in line 144-145). This also applies to $\tilde{F}_v^t$ and this is why it has five rows in our example. Each row in $\tilde{F}_u^t$/$\tilde{F}_v^t$ has two elements because the first element represents the frequency of the row-corresponding node in $\mathcal{S}_u^t$/$\mathcal{S}_v^t$ and the second element represents the frequency of the row-corresponding node in $\mathcal{S}_v^t$/$\mathcal{S}_u^t$. For example, for the first node $a$ in {$a, v, a$}, it appears in {$a, v, a$} twice and appears in {$b, b, u, a$} once. So the first row of $\tilde{F}_u^t$ is $[2,1]$. Please refer to [2] [3] for more details. They show that such frequency encoding is useful so we also consider it in our work. The shared MLP is just a design choice. Our motivation is to design an efficient model, so a shared MLP can help to keep fewer parameters.

Weakness 2.2

Your understanding is mostly correct. Increasing columns of matrix does not lead to higher time complexity. But by decreasing the row of matrix, the length of the input sequence into SSM is decreased, and this can save computational resources.

Weakness 2.3

Please note that node-level SSM encodes the information brought by each node's temporal neighbors, while time-level SSM only considers the information corresponding to a to-be-predicted link. For example, we want to predict a link between $u$ and $v$ at $t$. Node-level SSM encodes all the most recent temporal neighbors that have interactions with $u$ or $v$ before $t$, while time-level SSM only cares about the historical interactions between $u$ and $v$. Moreover, time-level SSM only models time intervals, while node-level SSM also considers node features. In line 59-69 we have given an example to explain why it is beneficial to model temporal patterns based on such time intervals. The "edge-specific" in "edge-specific temporal pattern" refers to the to-be-predicted link, which is $(u,v,t)$, rather than $u$'s or $v$'s temporal neighbors.

Weakness 3

We kindly ask you to pay attention to the average rank marked in our result tables. Our model consistently achieves the best overall rank among all methods. In a lots of cases, even if DyGMamba cannot rank the first, it can still be within the top 3.

Add:

Response to Reviewer 1QFu: Part B

Official Commentby Authors18 Nov 2024, 15:52 (modified: 19 Nov 2024, 12:28)EveryoneRevisions
Comment:

Weakness 4.1

We apologize for the incorrect dimensions. The dimension of $\mathbf{B}$ and $\mathbf{C}$ should be $\mathbb{R}^{d_1 \times 1}$ and $\mathbb{R}^{1 \times d_1}$, respectively. In this way $\mathbf{A}\mathbf{z}(\tau)$ and $\mathbf{B}\mathbf{q}(\tau)$ share the same dimension of $\mathbb{R}^{d_1 \times 1}$, and $r(\tau)$ is a scalar. We also notice this so it will be fixed in our revision.

Weakness 4.2

$d_2$ is the dimension of the input. In Eq. 1 and 2, $q(\tau)\in \mathbb{R}$ and $q_\tau\in \mathbb{R}$ are 1-dimensional inputs, so $d_2=1$. When $d_2 > 1$, $q(\tau)\in \mathbb{R}^{d_2}$ and $q_\tau\in \mathbb{R}^{d_2}$ become vectors. SISO means that given an input vector, an SSM will use the same set of parameters to process each element of this vector in parallel, and then collect the output of each dimension to form the processed vector. We strictly follow [4] (its page 4 "Structure and Dimensions.") to introduce SISO in our paper. For better readability, we will better introduce $d_2$ and put a more detailed explanation of SISO into the appendix of our revision.

Weakness 4.3

Thank you for spotting this! We will fix the typo.

[1] Behrouz, Ali, and Farnoosh Hashemi. "Graph mamba: Towards learning on graphs with state space models." Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024.

[2] Yu, Le, et al. "Towards better dynamic graph learning: New architecture and unified library." Advances in Neural Information Processing Systems 36 (2023): 67686-67700.

[3] Tian, Y., Qi, Y., & Guo, F. (2023). FreeDyG: Frequency Enhanced Continuous-Time Dynamic Graph Model for Link Prediction. In The Twelfth International Conference on Learning Representations.

[4] Gu, Albert, and Tri Dao. "Mamba: Linear-time sequence modeling with selective state spaces." arXiv preprint arXiv:2312.00752 (2023).

Add:

Official Comment by Reviewer 1QFu

Official Commentby Reviewer 1QFu22 Nov 2024, 14:07Everyone
Comment:

thank you for your response. I will adjust my scores accordingly.

Add:

Official Review of Submission5146 by Reviewer Gzn2

Official Reviewby Reviewer Gzn228 Oct 2024, 17:08 (modified: 12 Nov 2024, 16:18)EveryoneRevisions
Summary:

This paper presents DyGMamba, a modeling framework designed for continuous-time dynamic graphs. Built on a state space model (SSM), DyGMamba captures and leverages hidden temporal patterns from historical graph data. These temporal patterns guide the selection of critical information from a node’s temporal neighbors, enhancing the model's ability to predict dynamic links. DyGMamba achieves state-of-the-art results across seven datasets in dynamic link prediction tasks. Furthermore, extensive experiments have been conducted to demonstrate the framework's efficiency.

Soundness: 2: fair
Presentation: 3: good
Contribution: 1: poor
Strengths:
  1. The efficiency improvement introduced by Mamba is sound compared to the Transformer.
  2. This paper gives a clear and detailed description of the algorithm.
  3. Both link prediction and node classification tasks are discussed.
  4. The extensive experiments demonstrate that the Mamba structure significantly improves computational and memory efficiency compared to the Transformer.
Weaknesses:
  1. Low Novelty:
  • Based on my understanding of the algorithm, Section 3.1 is identical to DyGFormer, except that it replaces the Transformer with Mamba. In Section 3.2, the mean pooling mechanism in DyGFormer is modified to a "Dynamic Information Selection" strategy It appears, then, that the primary contribution lies in the design of the pooling strategy, which is more like a minor technical improvement and totally unrelated to the Mamba structure. This is somewhat misleading, given that both the name DyGMamba and the abstract emphasize the Mamba component as a key feature.
  • Furthermore, in the ablation study, Variant A uses mean pooling for the output in Equation 5. This modification makes Variant A essentially identical to DyGFormer, with the sole difference being the replacement of Transformer with Mamba. However, the performance of Variant A drops compared to DyGFormer, suggesting that incorporating the Mamba structure into DyGFormer negatively impacts performance. This observation implies that the performance improvements in DyGFormer stem from the "Dynamic Information Selection" mechanism, while the efficiency gains are due to the Mamba structure and the patching techniques utilized within DyGFormer. For this reason, DyGMamba’s design offers limited novelty, as its efficiency gains are entirely derived from prior studies—specifically, the Mamba structure and the patching techniques originally developed for DyGFormer.
  • In addition, from Equation 6, the recent interaction number k is selected as a very small value, such as 10 or 30. Given this, is the Mamba block necessary in this context? Since Mamba primarily enhances efficiency, its advantages may not be realized with such short sequences. I argue that the Mamba structure is unnecessary in Section 3.2. And it contradicts the motivation of the paper, suggesting that the Mamba structure may not necessary to your design.
  1. Reference issues:
  • In line 176, from the description of the time encoding function, it is not the one from TGAT [1], but from GraphMixer [2].
  • In line 180, the Node Interaction Frequency Encoding is followed FreeDyG [3], which should be Yu et al. [4], as it closely aligns with Yu's neighbor co-occurrence scheme. Specifically, the only difference between your description and DyGFormer’s neighbor co-occurrence scheme is the appending of the self-node to the last position in the sequence, which DyGFormer also implements.
  1. Some key content illustrations are vague and hard to follow.
  • In Section 3.2, lines 250–254, the part regarding "enabling batch processing" is difficult to follow. Specifically, how was the number $10^{10}$ derived? Is it accurate that using only 10 neighbors could result in such a large figure?
  • The overall explanation of "Dynamic Information Selection" is also unclear. It’s not evident why this particular design of Eq.7 was chosen or what advantages it offers. Additionally, it’s unclear how this design effectively achieves selection.
  • In the "Ablation Study," the design of Variant B is unclear, which is described as removing the Mamba SSM layers in Equation 4. Does this mean completely removing the layers, or does it involve replacing Mamba with Transformer? In my opinion, simply removing these layers lacks purpose, and replacing Mamba with Transformer would be more meaningful.
  1. Lacking Experiments:
  • From W1, it appears that the performance gains are entirely due to the "Dynamic Information Selection" mechanism. To thoroughly evaluate the impact of this design and the effectiveness of the time-difference encoding, a more detailed ablation study focused on this component would be beneficial. However, only the entirety of Section 3.2 is ablated as Variant A, which presents a rough experiment. To me, the only observation from Variant A is that the Mamba structure negatively affects DyGFormer’s performance.
  • The paper lacks tuning key hyperparameters such as $\alpha$ and $\beta$ in time encoding, the $\gamma$ in Time-Level SSM Block. These hyperparameters are crucial for model performance, and optimal settings may vary across different datasets. Additionally, the reason behind the introduction of $\gamma$ is not discussed.
  • The model was only evaluated on seven datasets out of 13 datasets used in DyGFormer. Could be discussed more thoroughly about the statistics of data sparsity and give the reason why only seven datasets are used. This would improve the persuasion of the experiments part.
  • Recently, a very related study, FreeDyG [3], has been proposed and might need to be compared in your experiments.
  1. There are no discussions regarding the limitations of this paper and possible solutions.

Conclusion:
This paper has a good example in the Introduction and is well-organized; however, the description of the key component in Section 3.2 is vague and difficult to follow. Additionally, the paper presents extensive experiments on efficiency comparisons, mainly attributed to the Mamba structure and patching techniques from the prior works. However, the effectiveness of the key component, "Dynamic Information Selection," is inadequately evaluated through experiments.
Most importantly, the novelty of the work is low, and the design of the algorithm raises concerns, particularly regarding the use of Mamba. Such as the results from Variant A indicate that incorporating Mamba can negatively impact performance, and the use of mamba in very short sequences, as shown in Equation 6. Therefore, I think this paper is insufficient for ICLR.

Reference:
[1] Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks, ICLR, 2022.
[2] Do We Really Need Complicated Model Architectures For Temporal Networks?, ICLR, 2023.
[3] FreeDyG: Frequency Enhanced Continuous-Time Dynamic Graph Model for Link Prediction, ICLR, 2024.
[4] Towards Better Dynamic Graph Learning: New Architecture and Unified Library, NIPS, 2023.

Questions:
  1. In the experiment illustrated in Fig. 4, which evaluates the scalability with varying numbers of neighbors, the analysis is conducted using the Enron dataset. However, the Enron dataset has an average degree of 676, meaning that only a small number of nodes have more than 676 neighbors. Have you considered the implications of this situation, where only a few nodes exceed this average degree? Could you discuss how this may influence the results presented in Fig. 4?

  2. In Figure 3a, why does the performance of DyGFormer decrease as the number of patches decreases, while the performance of DyGMamba increases? Could you provide some reasoning or discussion to explain this phenomenon?

  3. DyGFormer is a general framework designed to address the Continuous-Time Dynamic Graph (CTDG) problem, employing mean pooling to merge final node representations. In contrast, DyGMamba utilizes a short time-difference sequence to parameterize a weighted pooling mechanism instead of mean pooling. Based on the results from Variant A, I believe that Section 3.2 contributes to the observed performance gains. However, I have the following questions:

  • Is the Mamba structure necessary within the Time-Level SSM Block? From my perspective, both Mamba and Transformer can capture long-sequence dependencies, and Mamba has been shown to operate as a linear attention mechanism [2]. Therefore, the performance upper bound might be that of the Transformer itself or a Transformer with causal attention. Could you explain why Mamba is used for such short sequences in Section 3.2? Additionally, please provide a comparison with the Transformer structure.
  • Is time-difference encoding essential for the pooling strategy? After experimenting with your code and removing Equations (6a–7c) by setting $\beta=Softmax(Linear(H))$ in Eq. (7d) [1], I observed that the performance remains strong. This raises the question of whether time-difference encoding is necessary in your design. It seems likely that the performance gain results from weighted pooling versus mean pooling, rather than from time-difference encoded weighted pooling.
  • To justify the inclusion of Mamba in Section 3.2, you should consider applying "Dynamic Information Selection" to DyGFormer. Possible comparisons could include: 1. DyGFormer with dynamic information selection; 2. DyGFormer with dynamic information selection using the Transformer structure instead;
  1. As analyzed in W1, the introduction of Mamba appears to harm the performance of the DyGFormer structure, so further analysis of the reasons behind this would be helpful. More importantly, beyond the efficiency gains from the Mamba structure and patching techniques, could you please clarify the contributions of DyGMamba?

  2. In Section 3.2, lines 250–254, the part regarding "enabling batch processing" is difficult to follow. It’s unclear how using only 10 neighbors could lead to such a large value. Could you provide details on the batch processing—specifically, whether it follows the same batching process as DyGFormer? Additionally, how was the figure $10^{10}$ derived?

  3. The explanation of "Dynamic Information Selection" is generally unclear. There is no evidence about why the specific design in Equation 7 was chosen or what advantages it offers. Additionally, it’s not clear how this design effectively achieves selection. Could you provide a more thorough explanation of the intuition behind the Dynamic Information Selection mechanism and clarify its benefits compared to other weighted pooling mechanisms?

  4. Furthermore, other references for the State space model on dynamic graph learning should be added, such as [3], and the state-space model's applications on graphs should be discussed, such as [4].

  5. Finally, if the Mamba structure is not a key component of your design and the performance gains stem primarily from the pooling mechanism, demonstrating its effectiveness across other CTDG methods would be better for the paper. Integrating "Dynamic Information Selection" with other methods, such as FreeDyG, GraphMixer, and CTAN, to show performance improvements could clear the value of this component.

References:
[1] Language Modeling with Gated Convolutional Networks, ICML, 2017.
[2] Mamba: Linear-Time Sequence Modeling with Selective State Spaces, arxiv, 2023.
[3] STG-Mamba: Spatial-Temporal Graph Learning via Selective State Space Model, arxiv, 2024.
[4] Graph Mamba: Towards Learning on Graphs with State Space Models, KDD, 2024.

Flag For Ethics Review: No ethics review needed.
Rating: 3: reject, not good enough
Confidence: 5: You are absolutely certain about your assessment. You are very familiar with the related work and checked the math/other details carefully.
Code Of Conduct: Yes
Add:

Response to Reviewer Gzn2: Part A

Official Commentby Authors18 Nov 2024, 18:13Everyone
Comment:

We appreciate your effort in reviewing our paper. We have clarified the misunderstandings in our response. Please refer to the details below.

Before presenting our detailed response, we want to respectfully highlight our concern that you may be unintentionally misleading the Area Chair and other reviewers in evaluating our contributions. Here are some concrete points:

  1. You reduce our contribution to merely "the design of the pooling strategy" (e.g., in Weakness 1 and Question 8), overlooking our achievement as the first to apply Mamba to the domain of dynamic graph representation learning.

  2. You repeatedly question our contribution by assuming that Mamba is unnecessary because it cannot bring better performance. We believe this stems from a bias overly focused on performance while neglecting model efficiency. This perspective is misleading, since the inclusion of Mamba in our work is motivated by achieving high efficiency, which has been thoroughly discussed in our paper. Overlooking the efficiency benefits provided by Mamba and criticizing our work solely based on performance is unreasonable.

  3. You hold a review confidence of 5, but your review includes various incorrect claims. For example: (1) both points mentioned in Weakness 2 have already been correctly discussed in our paper; (2) In Weakness 4, Your request for additional experiments on other datasets is clearly beyond the scope of our paper (see our response to point 4 of Weakness 4); (3) In Weakness 4, the requested additional baseline has already been discussed in the paper (point 5 of Weakness 4). We believe the review should reflect greater responsibility, especially given the stated confidence of 5. We feel that such a high-confidence review should avoid these errors.

We hope our response can help you better evaluate our work and we hope our work is fairly judged by the reviewers and the Area Chair. We welcome follow-up discussion.

Add:

Response to Reviewer Gzn2: Part B

Official Commentby Authors18 Nov 2024, 18:31 (modified: 18 Nov 2024, 20:56)EveryoneRevisions
Comment:

Weakness 1

We appreciate and respect your opinion. Here is our response:

It appears, then, that the primary contribution lies in the design of the pooling strategy, which is more like a minor technical improvement and totally unrelated to the Mamba structure.

This is a serious misunderstanding. We have clearly outlined our contribution in line 70-77, which is the proposal of an efficient and effective continuous-time dynamic graph (CTDG) model based on SSM that can process long-range temporal information. The efficiency of our model benefits from Mamba so it is definitely not unrelated to our contribution.

This is somewhat misleading, given that both the name DyGMamba and the abstract emphasize the Mamba component as a key feature.

This is an unfounded accusation. We have not misled anything because we rely on Mamba to achieve strong efficiency and it is critical in our work. So it is our free choice to name our model as DyGMamba. We also believe that applying an existing structure, e.g., Mamba and Transformer, to another domain is not easy and is very meaningful. For example, Vision Mamba [1], ViT [2] and VideoMamba [3].

For this reason, DyGMamba’s design offers limited novelty, as its efficiency gains are entirely derived from prior studies—specifically, the Mamba structure and the patching techniques originally developed for DyGFormer.

We strongly disagree with your claim. As for efficiency, we are the first to employ Mamba in dynamic graph reasoning, which is novel. As mentioned in the last point, applying an existing structure to a new domain is meaningful. For example, ViT [2] leverages Transformer block to process image patches. We think it is unfair to claim that "ViT's effectiveness solely comes from Transformer so it is not novel". The same principle applies in our case. Besides, our dynamic information selection module is novel. It is well-motivated (in Introduction) and promotes model performance. To summarize, we believe that employing Mamba and proposing dynamic information selection to facilitate efficient and effective long-term temporal reasoning on CTDGs is novel and contributive enough.

Given this, is the Mamba block necessary in this context? Since Mamba primarily enhances efficiency, its advantages may not be realized with such short sequences. I argue that the Mamba structure is unnecessary in Section 3.2.

Mamba block in time-level SSM is just our design choice. It is a natural choice because (1) we use Mamba in nove-level SSM and (2) it will maintain good efficiency if we increase the recent interaction number $k$ in the future when we deploy DyGMamba on larger datasets that really require much longer historical histories for modeling.

And it contradicts the motivation of the paper, suggesting that the Mamba structure may not necessary to your design.

This is also a serious misunderstanding and misleading. The motivation for using Mamba is to enable efficient modeling of a large number of historical node interactions in the node-level SSM. This necessity exists regardless of the design choice for the time-level SSM used in dynamic information selection.

Weakness 2

In line 176, from the description of the time encoding function, it is not the one from TGAT, but from GraphMixer.

We are afraid that you are wrong here. The time-encoding function is from TGAT and it has learnable parameters while the function in GraphMixer does not have. So our citation is correct.

In line 180, the Node Interaction Frequency Encoding is followed FreeDyG, which should be Yu et al., as it closely aligns with Yu's neighbor co-occurrence scheme. Specifically, the only difference between your description and DyGFormer’s neighbor co-occurrence scheme is the appending of the self-node to the last position in the sequence, which DyGFormer also implements.

Please take a closer look at FreeDyG's section "Node Interaction Frequency (NIF) Encoding". FreeDyG borrows DyGFormer's neighbor co-occurrence scheme and appends the self-node to the last position in the sequence. So our reference is correct.

Add:

Response to Reviewer Gzn2: Part C

Official Commentby Authors18 Nov 2024, 18:46Everyone
Comment:

Weakness 3

Specifically, how was the number $10^{10}$ derived?

If we look at the time duration of the considered datasets, we can find that the duration is much smaller than $10^{10}$. For example, on MOOC, the time duration of the complete dataset is $176443$. This means that the time interval between each two of recent interactions involving $u$ and $v$ is much smaller than $10^{10}$. As explained in line 252-254, we can use this large number $10^{10}$ to indicate that the pair of nodes have not had an interaction for an extremely long time, same as existing no historical interaction. We choose $10^{10}$ because after observing the time duration of datasets, we think this number is big enough.

Is it accurate that using only 10 neighbors could result in such a large figure?

Note that we use this large number $10^{10}$ only when we cannot find $k$ recent interactions between $u$ and $v$. So it can be considered as "padding" the sequence until the length reaches $k$. If we have misunderstood your question, please correct us. $10^{10}$ has nothing to do with $k$.

It’s not evident why this particular design of Eq.7 was chosen or what advantages it offers. Additionally, it’s unclear how this design effectively achieves selection.

Please refer to point 2 of our response to Weakness 3 proposed by Reviewer TJxG for detailed explanation. In short, "critical temporal information" means the temporal neighbors assigned with greater weights in $\beta_\theta$. Each number in $\beta_\theta$ (Eq. 7d) denotes the importance of each of $\theta$'s temporal neighbors, and this number is solely decided by the opposite node as well as the encoded temporal pattern between the pair of nodes. The design of Eq. 7 is chosen to build connection between each pair of nodes, and it considers the edge-specific temporal patterns, which brings benefits.

Does this mean completely removing the layers, or does it involve replacing Mamba with Transformer? In my opinion, simply removing these layers lacks purpose, and replacing Mamba with Transformer would be more meaningful.

We have stated in our paper that it is removing the layers. As discussed in line 403-405, the aim of ablation study B is to demonstrate the importance of encoding the one-hop temporal neighbors with SSM layers for capturing temporal graph dynamics, which is purposeful. Note that we focus on developing an efficient model, switching Mamba to Transformer in modeling long historical node interaction sequences directly goes against our motivation. So we do not think it would be more meaningful.

Add:

Response to Reviewer Gzn2: Part D

Official Commentby Authors18 Nov 2024, 19:01Everyone
Comment:

Weakness 4

To me, the only observation from Variant A is that the Mamba structure negatively affects DyGFormer’s performance.

This observation is biased and we think it is unfair to judge our work based on it. We have to emphasize again that our motivation is to propose a method that can effectively and efficiently model long-range temporal information. We do not only focus on outperforming previous methods, including DyGFormer. Mamba is introduced to improve scalability and efficiency, rather than beating Transformer in performance.

To thoroughly evaluate the impact of this design and the effectiveness of the time-difference encoding, a more detailed ablation study focused on this component would be beneficial. However, only the entirety of Section 3.2 is ablated as Variant A, which presents a rough experiment.

Our ablation A has clearly shown that our dynamic information selection module can lead to better performance by distinguishing important temporal information from redundant parts, demonstrating the impact of its design. Besides, in line 61-69, we have explained that the evolution of time-differences (time intervals) delivers hints in prediction and that is the reason why we choose such design for capturing such patterns. Please also pay attention to our experiments in "A Closer Look into Temporal Pattern Modeling." in line 407-431. The synthetic datasets in our experiments require models to understand the evolution of time-differences and our dynamic information selection module is proven critical (Table 5). To address your concern, we will add new ablation study to compare our design with weighted pooling strategies (please refer to point 2 of our response to Question 3 later for detailed explanation).

The paper lacks tuning key hyperparameters such as $\alpha$ and $\beta$ in time encoding, the $\gamma$ in Time-Level SSM Block. These hyperparameters are crucial for model performance, and optimal settings may vary across different datasets. Additionally, the reason behind the introduction of $\gamma$ is not discussed.

$\alpha$ and $\beta$ are intermediate variables derived from embeddings and learnable parameters (we use them for simplicity in writing), rather than hyperparameters. $\gamma$ is set to $0.5$ for all datasets as pointed out in our experiments. We did search $\gamma$'s value in {$0.1, 0.5, 0.7, 1$} and we found that $0.5$ is a value where both efficiency and performance can have a nice balance for all datasets. The aim of introducing $\gamma$ is to further improve efficiency. A smaller $\gamma$ can lower the computational resource consumption in time-level SSM, potentially at the cost of performance. Raising $\gamma$'s value does not necessarily lead to better performance but will definitely lower efficiency. We will put this discussion in our revision.

The model was only evaluated on seven datasets out of 13 datasets used in DyGFormer. Could be discussed more thoroughly about the statistics of data sparsity and give the reason why only seven datasets are used. This would improve the persuasion of the experiments part.

Please note that only 7 datasets from DyGLib are CTDGs, and we have considered all of them. CTDGs are completely different from discrete-time dynamic graphs (DTDGs). We have explained their differences in line 35-42 and have explicitly specified that we ONLY consider CTDG modeling in our title, abstract and everywhere else.

Recently, a very related study, FreeDyG, has been proposed and might need to be compared in your experiments.

We have explicitly stated in line 310-312 that we have tried to reproduce FreeDyG's results with their official code but we found that the reported results are not reproducible. For example, the loss cannot converge on LastFM, which is one of the most important datasets that require long-term temporal reasoning.

Weakness 5

We will discuss the limitations and possible solutions in our revision

Question 1

Please note that both DyGMamba and DyGFormer pad sequences before they send them into SSM and Transformer. The number of sampled temporal neighbors $\rho$ is pre-fixed to a number (e.g., 256 or 512) in both models, and this number is directly decided by the hyperparameter $\rho$ & $p$ as we discussed in "Implementation Details and Evaluation Settings" as well as Appendix C.1. As a result, regardless of the average node degree of the dataset, same trend will appear as in Fig. 4. Imagine we have a much larger and denser long-range temporal dependent dataset and the average degree is much larger, we will naturally increase $\rho$ to incorporate more temporal information. In this case, the scalability of our model is very important.

Add:

Response to Reviewer Gzn2: Part E

Official Commentby Authors18 Nov 2024, 20:21 (modified: 18 Nov 2024, 20:25)EveryoneRevisions
Comment:

Question 2

We have explained the reason for this result in line 479-504. This is due to the usage of our dynamic information selection module. To supplement, we give a more detailed discussion here. First, please note that patching decreases sequence length by applying projection matrix on a patch of node embeddings, giving model much more parameters to tune (as indicated in Fig. 3b). Larger patch size mixes more sampled neighbors in each patch, introducing more training parameters while losing nuanced temporal details brought by the temporal order of the neighbors within each patch. For DyGFormer, the negative influence brought by mixing neighbors within patches is smaller than the positive influence brought by more trainable parameters, so the performance constantly increases with a growing patch size in our Fig. 3a. By contrast, for DyGMamba, the negative influence brought by mixing neighbors is much greater than the positive influence brought by more trainable parameters, so it shows better performance when the patch size is smaller. For Variant A, given an increasing patch size, it follows the trend of DyGFormer when the patch size is below a threshold and shows degrading performance after that. This means that there is a trade-off between the lost temporal details and the additional parameters when we modify patch size. Also, by comparing Variant A and DyGMamba, we can tell that the difference in performance trend roots from the dynamic information selection module. Smaller patch size leads to longer sequences with more temporal details. The strong capability of the dynamic information selection module in long-range temporal reasoning let DyGMamba benefit from more nuanced temporal details, which is more influential than increasing training parameters. We will include this discussion in our revision.

Question 3

Could you explain why Mamba is used for such short sequences in Section 3.2? Additionally, please provide a comparison with the Transformer structure.

Please refer to point 4 of our response to Weakness 1 for the explanation of why we use Mamba to capture temporal patterns in Section 3.2. Please note that using Transformer for temporal pattern modeling is against our motivation of proposing a model scalable to long-range temporal information. As we have mentioned, if in the future we want to deploy DyGMamba on larger datasets that really require much longer historical histories for modeling, the value of $k$ will also increase accordingly and the time interval sequence will not be short anymore. In this case, Transformer will cause trouble in efficiency due to poor scalability.

whether time-difference encoding is necessary in your design

We will add another two ablation studies C and D that switch our design to other weighted pooling strategies regarding the one you proposed here ($\beta = \text{Softmax}(\text{Linear}(\mathbf{H}))$). The results will be provided in the revision.

To justify the inclusion of Mamba in Section 3.2, you should consider applying "Dynamic Information Selection" to DyGFormer. Possible comparisons could include: 1. DyGFormer with dynamic information selection; 2. DyGFormer with dynamic information selection using the Transformer structure instead;

Applying dynamic information selection to DyGFormer cannot help to justify the inclusion of Mamba in Section 3.2. We have already highlighted throughout our paper as well as our rebuttal that we wish to build an efficient model, and this strong motivation has justified the introduction of Mamba in Section 3.2 (as discussed in the first point of our response to Question 3). From our understanding, your concern roots from the performance and you suspect that using Transformer rather than Mamba in dynamic information selection can lead to better model performance. However, the inclusion of Mamba in Section 3.2 is motivated by improving efficiency, rather than performance. And therefore, we think the experiments you have mentioned are not indispensable to support the claims and design in our paper. We hope you could understand.

Add:

Response to Reviewer Gzn2: Part F

Official Commentby Authors18 Nov 2024, 20:39 (modified: 19 Nov 2024, 12:40)EveryoneRevisions
Comment:

Question 4

As analyzed in W1, the introduction of Mamba appears to harm the performance of the DyGFormer structure, so further analysis of the reasons behind this would be helpful

This is a repeated question. We have to point out again that your analysis is seriously biased. As mentioned in the first point of our response to Weakness 4, we propose to use Mamba to efficiently model long-range temporal information, rather than beating Transformer in performance.

beyond the efficiency gains from the Mamba structure and patching techniques, could you please clarify the contributions of DyGMamba?

We believe it is unfair to undervalue or overlook our contribution in introducing Mamba into CTDG modeling. We have given reasonable motivation and demonstrated the advantages of Mamba in the case of dynamic graph reasoning with extensive experiments. Mamba can help to efficiently capture long-range temporal information in CTDG modeling. Besides, Our dynamic information selection module can effectively model long-term temporal dependencies in CTDGs. It brings more benefits when more temporal information are considered.

Question 5

We have provided an example in line 251-254. To supplement, let us give a more detailed explanation. Assume we have a batch of potential links {$(u, v, t)$} during evaluation, where each of them has different number of historical interactions between the nodes in it. Given $k$ with a fixed value, we cannot make sure we can find $k$ such interactions for every potential link. This makes the lengths of time-difference sequences different for different potential links. To enable batch processing, we wish the lengths of all time-difference sequences to be $k$. So we pad the sequences shorter than $k$ with a very huge number repeatedly (in our case we use $10^{10}$), until their lengths reaches $k$. This huge number indicates that the pair of nodes have not had an interaction for an extremely long time, similar to existing no historical interaction. The batching process in time-level SSM is different from the batching process as in DyGFormer because the pad values are different. Please refer to point 1 and 2 of our response to Weakness 3 for the explanation of why we choose $10^{10}$ for padding.

Question 6

This is also a repeated question. We kindly ask you to refer to (1) our response to the Weakness 3 proposed by Reviewer TJxG; and (2) point 3 of our response to your Weakness 3. In short, this design is to promote (1) information selection in large amount of temporal information by assigning larger weights to important temporal neighbors and (2) inter-node information fusion between the nodes in each pair conditioned on their historical temporal patterns. To supplement, we are adding two ablation studies C and D to compare our design with two different weighted pooling strategies, as mentioned above.

Question 7

Thank you for your suggestion. We will add them in the revision. But please note that spatial-temporal graphs and dynamic graphs are different types of data. The research studying dynamic graphs focuses on the temporal evolution of the graph structure, while those who study spatial-temporal graphs focus on the evolving spatial and temporal features. Also, static graph representation learning methods such as Graph Mamba do not consider dynamic evolution of the graph structure. Therefore, Discussing these related works does not diminish our contribution.

Question 8

Your assumption "if the Mamba structure is not a key component of your design" is incorrect and misleading. We have justified why Mamba is the key component throughout our rebuttal. We look forward to your further response as well as follow-up questions. We respect your opinion, but we hope you won’t be influenced by your own subjective assumptions.

[1] Zhu, Lianghui, et al. "Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model." Forty-first International Conference on Machine Learning.

[2] Dosovitskiy, Alexey. "An image is worth 16x16 words: Transformers for image recognition at scale." arXiv preprint arXiv:2010.11929 (2020).

[3] Li, Kunchang, et al. "Videomamba: State space model for efficient video understanding." European Conference on Computer Vision. Springer, Cham, 2025.

Add:

Response to Author's rebuttal

Official Commentby Reviewer Gzn224 Nov 2024, 09:50 (modified: 24 Nov 2024, 10:15)EveryoneRevisions
Comment:
  • First and foremost, I do acknowledge applying Mamba as a contribution. However, the primary issue lies in the observations from Variant A, where applying Mamba leads to a performance drop of DyGFormer—a critical point that the authors fail to address, even in the revised version.
    As you mentioned, Mamba is applied to improve efficiency and the pooling strategy is modified to improve effectiveness. Mamba, being widely recognized as more efficient than transformers, offers efficiency gains. However, the performance degradation caused by Mamba raises a significant research question. Rather than investigating why Mamba leads to this performance degradation, the authors propose a completely unrelated pooling strategy (unrelated to mamba) to offset the performance degradation.
    In my view, the key research focus should be on understanding why Mamba negatively impacts performance and addressing that issue, rather than introducing an unrelated pooling structure to obscure the performance degradation. This would provide a more meaningful and scientifically grounded contribution.
    Why do I say the pooling structure is unrelated to mamba? Even though the proposed pooling strategy incorporates Mamba, the sequence length $k$ in the pooling strategy is significantly smaller than the number of neighbors $|Nei|$. In such short sequences, the efficiency advantage of Mamba over transformers is negligible. For example, in SocialEvo dataset with average neighbor number 27993, max neighbor number 94599, it only use $k=5$. From the perspective of effectiveness, this pooling module’s contribution to effectiveness appears unrelated to Mamba. Considering replacing Mamba with other structures, such as Transformer, GLA [1], or RWKV[2] (both effecient than transformer), would the performance gain from this module change?

  • Secondly, while you emphasize the efficiency contribution of using Mamba, it is undeniable that Mamba is more efficient than transformers. However, this does not mean that replacing transformers with Mamba in DyGFormer constitutes a significant contribution. Simply substituting Mamba for transformers without investigating the reasons behind the performance degradation cannot be considered a major research achievement.

  • Thirdly, regarding the origins of the Time Encoding, I think it is derived from GraphMixer rather than TGAT. This is because the time encoding in TGAT is defined as $[cos(w_1 t),sin(w_1 t), …, cos(w_n t), sin(w_n t)]$, while GraphMixer's time encoding follows the format $[cos(w_1 (t-t’)), cos(w_2 (t-t’)), ...,cos(w_n (t-t’))]$. DyGFormer adopts the latter format with trainable $\boldsymbol{w}$, aligning more closely with GraphMixer’s approach.
    As for the Node Interaction Frequency Encoding, why do I attribute it to DyGFormer rather than FreeDyG? The only notable difference between DyGFormer and FreeDyG lies in the position where the self-node is appended: DyGFormer appends the self-node to the first position, while FreeDyG appends it to the last. Given this distinction, the Node Interaction Frequency Encoding originates from DyGFormer. That said, if you append the self-node to the last position, it aligns with FreeDyG—this was my misunderstanding earlier, and I acknowledge the correction.
    Additionally, my request for further experiments on other datasets is not unreasonably beyond the scope of your work. Since your research builds upon the DyGFormer framework, which has been evaluated on a comprehensive range of datasets, it is reasonable to ask why your experiments are limited to datasets with strictly continuous timestamps. Why can’t your method handle datasets with larger time intervals? Which specific module in your method restricts the framework to only handle datasets with continuous timestamps?

At last, the author has emphasized the issue of "repeated questions". I want to clarify that there are no repeated questions in my review. Weaknesses identify an inherent flaw in your work, while questions highlight aspects that you could address to improve and enhance your study. While they may overlap in viewpoints, their purposes are distinct: weaknesses point out existing shortcomings, whereas questions offer opportunities for further refinement.

Overall, while the author has addressed some issues, the core design problem remains unresolved—specifically, the lack of investigation into the performance degradation caused by applying Mamba. Instead of focusing on understanding and mitigating this degradation to improve effectiveness, the work modifies DyGFormer’s pooling strategy, which is unrelated to the Mamba structure, to enhance effectiveness. This gives the impression that the pooling strategy modification serves to obscure the performance degradation introduced by applying Mamba.

For these reasons, I keep my score unchanged.

[1] Gated linear attention transformers with hardware-efficient training
[2] Rwkv: Reinventing rnns for the transformer era

Add:

Further Response: Part A

Official Commentby Authors24 Nov 2024, 18:52 (modified: 25 Nov 2024, 08:30)EveryoneRevisions
Comment:

Thank you for acknowledging our contribution of introducing Mamba into dynamic graph representation learning. And also we appreciate the re-clarification of your concerns. Here is our further response.

the primary issue lies in the observations from Variant A, where applying Mamba leads to a performance drop of DyGFormer—a critical point that the authors fail to address, even in the revised version.

Variant A is the ablation study showing the effectiveness of dynamic information selection. We do not focus on developing a new SSM structure to surpass Transformer in performance. We employ the dynamic selection module to overcome Mamba's performance limitations, with a design specifically tailored for CTDG modeling. After all, we are not a paper trying to improve Mamba's general performance in sequence modeling. We just focus on how to apply it in dynamic graph representation learning to enable efficient and effective reasoning. So we really do not think that we have anything to address.

Rather than investigating why Mamba leads to this performance degradation, the authors propose a completely unrelated pooling strategy (unrelated to mamba) to offset the performance degradation.

Why do we have to propose a method to offset performance degradation based on changing Mamba? Our paper is to address problems in the field of dynamic graph representation learning (please note that the track of our submission is "learning on graphs and other geometries & topologies"). Of course we think developing a new SSM structure to promote performance is very interesting and meaningful, but it is not a must when we apply Mamba to a new domain, right? Dynamic information selection tries to improve performance from the side of graph learning, which perfectly aligns with the research field of dynamic graph. We believe this contribution has been well discussed and we have proven its effectiveness within our research scope. As a paper focused on studying graphs, investigating why Mamba leads to performance degradation compared to Transformer is not necessarily within the scope of our work. We hope you can understand our perspective. Even if your opinion remains unchanged, we respect your viewpoint. However, we believe we have clearly articulated our considerations and do not find the point you raised to be a valid reason to criticize our paper.

Why do I say the pooling structure is unrelated to mamba?...From the perspective of effectiveness, this pooling module’s contribution to effectiveness appears unrelated to Mamba.

Please note that the inclusion of Mamba in dynamic information selection is just a design choice. We have explained our choice in our response to the second last point of your Weakness 1. To supplement, we start from discussing your example on SocialEvol. On SocialEvol, we set $k=5$ because the optimal value of the number of sampled one-hop neighbor $|Nei_\theta^t|$ is 32. We mentioned in line 256-257 of revision that $k$ is set smaller than $|Nei_\theta^t|$. So it has no direct relation to the average neighbor number and the max neighbor number. If you could pay attention to Enron, you would find that we set $|Nei_\theta^t| = 256$ and $k=30$. So the optimal value of $k$ is related to $|Nei_\theta^t|$. Imagine in the future we have very huge new datasets that actually require a very big $|Nei_\theta^t|$, our introduction of Mamba in the time-level SSM would demonstrate its advantage.

Considering replacing Mamba with other structures, such as Transformer, GLA [1], or RWKV [2] (both effecient than transformer), would the performance gain from this module change?

We believe the performance gain will change if we change Mamba to other sequence modeling structures. But after all, it is just a design choice -- just like some works choose to use LSTM to model sequences while some other works achieve that with GRU. Our choice of Mamba is intuitive and with the consideration of potential generalization power on future datasets. We really appreciate your suggestion, but we still believe this point is not a valid reason to criticize our paper.

Add:
Replying to Response to Author's rebuttal

Further Response: Part B

Official Commentby Authors24 Nov 2024, 19:38Everyone
Comment:

Secondly, while you emphasize the efficiency contribution of using Mamba, it is undeniable that Mamba is more efficient than transformers. However, this does not mean that replacing transformers with Mamba in DyGFormer constitutes a significant contribution. Simply substituting Mamba for transformers without investigating the reasons behind the performance degradation cannot be considered a major research achievement.

Still, as we have mentioned in Further Response: Part A, as a paper focused on studying graphs, investigating why Mamba leads to performance degradation compared to Transformer is not necessarily within the scope of our work. Our proposed Dynamic information selection is also a great contribution beyond introducing Mamba into dynamic graph representation learning. This information selection module tries to improve performance from the side of graph learning, which perfectly aligns with the research field of dynamic graph. We believe that designing an efficient and effective CTDG model that can handle long-term temporal information is a major research achievement.

regarding the origins of the Time Encoding...

Sorry you are still wrong. GraphMixer's $\omega$ are not learnable. Try searching this sentence "Notice that ω is fixed and will not be updated during training." in its page 4. By contrast, TGAT's parameters are learnable, which is same as ours. We think your misunderstanding comes from TGAT's Eq. 5. But please also pay attention to its Eq. 6, where TGAT inputs time difference into the function of its Eq. 5, which is same as ours. In line 183-186 of our revision, we have clearly said that we use TGAT's function to encode time differences. So our citation is correct.

it is reasonable to ask why your experiments are limited to datasets with strictly continuous timestamps. Why can’t your method handle datasets with larger time intervals? Which specific module in your method restricts the framework to only handle datasets with continuous timestamps?

We have actually discussed this in our revision in Appendix K. The problem is "not suitable", rather than "we can't". As discussed in our revision: "DTDGs are represented as a sequence of graph snapshots, where all the edges in a snapshot are taken as existing simultaneously. This poses a challenge to DyGMamba because it can only encode edges sequentially, which are not suitable for modeling concurrent edges." We think DyGFormer is also not suitable for DTDG modeling, even if it provided experiments on them. DyGFormer randomly arranges the order of temporal neighbors happening at the same timestamp in DTDGs, which makes it also not suitable to model DTDGs. For both DyGMamba and DyGFormer, they cannot optimally capture the concurrent links of each graph snapshot. A possible solution is to equip them with graph neural networks, but it already goes beyond our scope. That is why we mention again and again throughout our paper and rebuttal that we only focus on CTDG modeling. Of course we can experiment DyGMamba on DTDGs, but we do not wish to overstate our method's applicability to that graph types that we believe are not perfectly suitable. We believe that CTDG modeling is both challenging and meaningful, and we have clearly defined the scope of our research. Therefore, we do not consider this a valid basis for criticizing our paper.

Add:

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview