MGMA: Mesh Graph Masked Autoencoders for Self-supervised Learning on 3D Shape

Withdrawn Submission by ConferenceMGMA: Mesh Graph Masked Autoencoders for Self-supervised Learning on 3D Shape

22 Sept 2022 (modified: 13 Feb 2023)ICLR 2023 Conference Withdrawn SubmissionReaders: EveryoneShow BibtexShow Revisions
Keywords: mesh graph, self-supvervised learning, masked autoencoder, attention
TL;DR: We introduce a self-supervised learning model to extract face nodes and global graph embeddings on meshes.
Abstract: We introduce a self-supervised learning model to extract face nodes and global graph embeddings on meshes. We define a graph masking on a mesh graph composed of faces. We evaluate our model on shape classification and segmentation benchmarks. The results suggest that our model outperforms prior state-of-the-art mesh encoders: In ModelNet40 classification task, it achieves an accuracy of 89.8% and in ShapeNet segmentation task, it achieves a mean Intersection-over-Union (mIoU) of 78.5. Further, we explore and explain the correlation between test and training masking ratios on MGMA. And we find best performances are obtained when mesh graph masked autoencoders are trained and evaluated under different masking ratios. Our work may open up new opportunities to address label scarcity and improve the learning power in geometric deep learning research.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Supplementary Material:  zip
Please Choose The Closest Area That Your Submission Falls Into: Unsupervised and Self-supervised learning

Reply Type:
Author:
Visible To:
Hidden From:
5 Replies
[–][+]

Submission Withdrawn by the Authors

Withdraw by Paper776 AuthorsSubmission Withdrawn by the Authors

ICLR 2023 Conference Paper776 Authors
16 Jan 2023, 06:24ICLR 2023 Conference Paper776 WithdrawReaders: EveryoneShow Revisions
Withdrawal Confirmation: I have read and agree with the venue's withdrawal policy on behalf of myself and my co-authors.
[–][+]

Official Review of Paper776 by Reviewer RF7b

Official Review of Paper776 by Reviewer RF7b

ICLR 2023 Conference Paper776 Reviewer RF7b
29 Oct 2022, 01:50ICLR 2023 Conference Paper776 Official ReviewReaders: EveryoneShow Revisions
Summary Of The Paper:

The paper proposes a novel attention based encoder architecture for meshes, where each node corresponds to a face on the mesh, the attention module aggregates information from the nodes, then the face node encodings are pooled into a global feature.

The paper also proposes a self-supervised task to train the encoder by randomly masking out some of the node features, then trying to reconstruct the shape of the mesh

Strength And Weaknesses:

Strengths:

The idea and implementation is very simple, yet effective.

Competitive with SOTA.

Weaknesses:

The contribution is incremental in my opinion:

  • The encoder is a twist on point cloud based transformers: Instead of using the vertices, the paper uses midpoint of faces.
  • The self supervised task is a natural extension of [He et al. 2022] to mesh data

Sometimes there is a lack of clarity, more precise writing would be helpful for the reader.

It is weird that point cloud only methods perform better than mesh based ones (tab 3), especially as a point cloud can easily be extracted from a mesh. As if using the neighbourhood info hurts. So why not just use e.g. Point Transformer [https://arxiv.org/abs/2012.09164]? It has the highest baseline for supervised classification and the self-supervised representation learning task would work exactly the same way. Why is the mesh advantageous (for te tasks it is applied to)?

Clarity, Quality, Novelty And Reproducibility:

Clarity:

There are a lot of details missing, the reader can only guess them from the context:

  • What are the face node features? Wre they 3D coordinates of face midpoints?
  • in eq2 again, what are the input features to the encoder? Face midpoints?
  • 4.1. it is not explicitly shown with a loss equation that the output of the encoder is used to predict the classes and cross-entropy is used (?), while the decoder and the self supervised loss does not play any part here.

Quality:

The results are competitive with SOTA

Novelty:

Incremental, but novel. I have not seen this encoder before, nor the extension of [He et al. 2022] to mesh data.

Reproducibility:

From the paper a competent practitione could reimplement it.

Summary Of The Review:

The paper is not bad or wrong, but it does not advance the field significantly enough. It does not break new ground in performance, does not provide surprising new insights or has a new theory.

Correctness: 4: All of the claims and statements are well-supported and correct.
Technical Novelty And Significance: 2: The contributions are only marginally significant or novel.
Empirical Novelty And Significance: 2: The contributions are only marginally significant or novel.
Flag For Ethics Review: NO.
Recommendation: 5: marginally below the acceptance threshold
Confidence: 3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked.
[–][+]

Official Review of Paper776 by Reviewer tuMB

Official Review of Paper776 by Reviewer tuMB

ICLR 2023 Conference Paper776 Reviewer tuMB
25 Oct 2022, 21:25ICLR 2023 Conference Paper776 Official ReviewReaders: EveryoneShow Revisions
Summary Of The Paper:

The paper presents mask autoencoders for meshes. The idea is motivated by masked autoencoders in the case of images. The work treats meshes as graphs with mesh faces as nodes and edges to capture mesh topology/connectivity. Masking is done by randomly removing nodes/faces along with associated edges, similar to the idea of masked autoencoders. The actual method is a pretty direct application of masked AE to the graph network setting by adopting face attention mechanism. Comparison is provided on Shrec11 and modelNet40 (both datasets are close to being saturated), with very marginal improvement, if any.

The paper has limited novelty and results in very marginal improvements.

Strength And Weaknesses:
  • Uses masked AE in the context of meshes
  • A simple adaptation and application to graph neural networks
  • Very limited novelty
  • Performance improvements are marginal at best
  • Scores low wrt novelty and performance. Would have liked to see test on more challenging datasets or real/scan data.
Clarity, Quality, Novelty And Reproducibility:

Section 3 is simple and concise. It is appropriate given the (limited) contribution of the work.

Novelty is low. A pretty direct adaptation of masked AE for images.

Should be reproducible.

Summary Of The Review:

Limited novelty and very marginal performance enhancement prompt me to give a low score.

Correctness: 3: Some of the paper’s claims have minor issues. A few statements are not well-supported, or require small changes to be made correct.
Technical Novelty And Significance: 2: The contributions are only marginally significant or novel.
Empirical Novelty And Significance: 2: The contributions are only marginally significant or novel.
Flag For Ethics Review: NO.
Recommendation: 3: reject, not good enough
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.
[–][+]

Official Review of Paper776 by Reviewer hqAC

Official Review of Paper776 by Reviewer hqAC

ICLR 2023 Conference Paper776 Reviewer hqAC
24 Oct 2022, 13:49ICLR 2023 Conference Paper776 Official ReviewReaders: EveryoneShow Revisions
Summary Of The Paper:

In this paper, the authors propose a self-supervised method to learn mesh representation. The proposed method uses graph attention layer to build the model and the graph masking strategy to train the model. The learned global feature is trained with the task to reconstruct the point clouds. The trained model is evaluated on shape classification and segmentation tasks on SHREC11 and ModelNet40 datasets.

Strength And Weaknesses:

Pros

-This paper is well written and easy to follow.

-The proposed representation learning is self-supervised without extra annotation efforts. It also takes advantage of the graph attention layer and graph masking to learn effective representation.

-Experimental results shows the proposed method achieves comparable performance against the existing methods on the shape classification and segmentation tasks. The analysis on the effect of masking ratios at training and evaluation stages reveals that different ratios lead to performance variation.

Cons

-The technical contribution of this work is limited given the similar work on graph attention layer for point cloud/mesh processing and masking strategy in image processing. The studies of the effect of the training and evaluation ratios is more an empirical study than a contribution as no guidance is provided on how to select the optimal values without trying all possible values.

  • In Section 4.2, the best model is picked according to the lowest Chamfer Distance on provided test data. Although the authors argue there is no label leakage, the involvement of test data in the selecting of models is not acceptable. What is the averaged performance of the performance of a randomly picked model?

-The authors argues one reason that their method outperforms others in Table 3 is the attention mechanism. However, evidence of the learned attention is expected to be provided to support the statement.

-In the text, the classification task is put before the task of segmentation, which is different from the order in the figures 3,9, and 10. The authors are advised to make consistent order.

-In Table 4, it is unclear if the Multi-Task model is the one from the reference Hassani & Haley, 2019. Furthermore, more baselines on this task are expected to be added to Table 4.

-In Figure 4, both (b) and (c) are about the train and test masking ratios.

-It is unclear what the authors intend to state at the end of Section 5.1 'Second, having the same masking ratio could make the model rely on finding masking information from the mesh'. It is unclear how finding masking information can help the task of classification or segmentation.

-The networks for classification and segmentation are not shown in either the main text or appendix. It is weird to me the author set 1-ring neighbor for neighboring lookup in the first two mesh graph attention blocks and 2-ring neighbors for the last layer. The authors may explain the reason for the setting.

-In Section 4.1, 'other night methods' should be 'other nine methods'. The format of references in Table 3 are incorrect.

Clarity, Quality, Novelty And Reproducibility:

The paper is easy to follow. It combines existing techniques and conduct some further experimental analysis on the setting of masking ratios. However, the contribution is limited.

Summary Of The Review:

Overall, this paper lacks technical contribution and has unclear details as stated in the weaknesses part.

Correctness: 3: Some of the paper’s claims have minor issues. A few statements are not well-supported, or require small changes to be made correct.
Technical Novelty And Significance: 2: The contributions are only marginally significant or novel.
Empirical Novelty And Significance: 2: The contributions are only marginally significant or novel.
Flag For Ethics Review: NO.
Recommendation: 3: reject, not good enough
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.
[–][+]

Official Review of Paper776 by Reviewer Z7zm

Official Review of Paper776 by Reviewer Z7zm

ICLR 2023 Conference Paper776 Reviewer Z7zm
19 Oct 2022, 15:39ICLR 2023 Conference Paper776 Official ReviewReaders: EveryoneShow Revisions
Summary Of The Paper:

The paper proposes a masked autoencoder architecture for 3D meshes. The input mesh is processed to consider the faces as nodes of a connected non-oriented graph. Then, some node of the so obtained graph is masked and passed to attention layers; a final max-pool produces the graph embedding that can be used for different applications. The method is tested on the Classification and Part segmentation tasks, with some qualitative results about shape reconstruction, showing results on par with SoTA.

Strength And Weaknesses:

STRENGTH

S1) RESEARCH DIRECTION: Looking for adaptation of masking and attention mechanism on mesh data is a compelling and quite active research field in the Geometric Deep Learning Community.

S2) MASK DISCUSSION: The paper discusses the performance variation at a different portion of masking. Such analysis is useful for future works, pointing out that there is a margin to find more sophisticated masking policies.

WEAKNESSES

W1) NOVELTY: The proposed method does not introduce any particular methodological novelty. Attention mechanisms are already applied to 3D shapes (e.g., "Shape registration in the time of transformers", Trappolini et al., 2021), as well as masking techniques on meshes (Liang et al. 2022). On graphs, some attempts can be mentioned as well (e.g., "MGAE: Masked Autoencoders for Self-Supervised Learning on Graphs", Tan et al., 2022, "Graph Masked Autoencoders with Transformers", Zhang et al., 2022). Further references can be found in a recent survey ("A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond", Zhang et al., 2022). Similarly, several references can be found in "Transformers in 3D Point Clouds: A Survey", Lu et al., 2022. I am particularly concerned by (Liang et al. 2022), since it is mentioned in the related work without providing its positioning w.r.t. paper contribution. I think a proper discussion of the methodological novelty should be provided.

W2) RELATED WORKS AND COMPARISONS: I think there are significant works that are not discussed and on which a comparison would be relevant. DiffusionNet (Sharp et al., 2020) and DeltaConv (Wiersma et al., 2022) are both SoTA networks for meshes and pointcloud that show results on the same datasets used in this paper (also, DeltaConv has the best results on ModelNet40). I also suggest modifying both "Self-Supervised Learning" and "Transformer Applications" paragraphs to be more focused on 3D data and, in particular meshes.

W3) EXPERIMENTS: The architecture is quantitatively tested only on two datasets, of which previous methods already saturate one. It is not clear the applicability of the method on more challenging and advanced tasks (e.g., protein segmentation, shape matching, non-rigid objects segmentation/classification). I think this significantly limits the impact of the work and does not clarify the real applicative scenarios of the method. Finally, while the results are promising, in general, they do not fully reach the state of the art.

Clarity, Quality, Novelty And Reproducibility:

The novelty of the method is among my main concerns. Similar ideas have already been proposed in the same domain. I have no particular doubts about the method's reproducibility. The paper is overall clear, while the introduction contain vague statements which may convey a wrong message:

  1. "mesh, an irregular data format" -> "irregular" is unclear; since they enjoy more structure than a point cloud.
  2. "Traditional studies do not consider unlabeled data" -> what does "traditional" mean exactly?

MINOR FIXES I have also found some typos:

  1. "neighing nodes' -> Neighbourhood nodes?
  2. "Equation1to" -> Missing a space
  3. "we first verify" -> Missing capital letter
Summary Of The Review:

My principal concerns are about the novelty of the method and the chosen experimental setting. At the present state, I do not think the work is ready for publication. I am looking forward to the rebuttal for a discussion about the work's original contribution and positioning also w.r.t. missing methods.

Correctness: 4: All of the claims and statements are well-supported and correct.
Technical Novelty And Significance: 2: The contributions are only marginally significant or novel.
Empirical Novelty And Significance: 2: The contributions are only marginally significant or novel.
Flag For Ethics Review: NO.
Recommendation: 3: reject, not good enough
Confidence: 4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.