Withdrawn Submission by Conference • MGMA: Mesh Graph Masked Autoencoders for Self-supervised Learning on 3D Shape
Official Review of Paper776 by Reviewer RF7b
Official Review of Paper776 by Reviewer RF7b
The paper proposes a novel attention based encoder architecture for meshes, where each node corresponds to a face on the mesh, the attention module aggregates information from the nodes, then the face node encodings are pooled into a global feature.
The paper also proposes a self-supervised task to train the encoder by randomly masking out some of the node features, then trying to reconstruct the shape of the mesh
Strengths:
The idea and implementation is very simple, yet effective.
Competitive with SOTA.
Weaknesses:
The contribution is incremental in my opinion:
- The encoder is a twist on point cloud based transformers: Instead of using the vertices, the paper uses midpoint of faces.
- The self supervised task is a natural extension of [He et al. 2022] to mesh data
Sometimes there is a lack of clarity, more precise writing would be helpful for the reader.
It is weird that point cloud only methods perform better than mesh based ones (tab 3), especially as a point cloud can easily be extracted from a mesh. As if using the neighbourhood info hurts. So why not just use e.g. Point Transformer [https://arxiv.org/abs/2012.09164]? It has the highest baseline for supervised classification and the self-supervised representation learning task would work exactly the same way. Why is the mesh advantageous (for te tasks it is applied to)?
Clarity:
There are a lot of details missing, the reader can only guess them from the context:
- What are the face node features? Wre they 3D coordinates of face midpoints?
- in eq2 again, what are the input features to the encoder? Face midpoints?
- 4.1. it is not explicitly shown with a loss equation that the output of the encoder is used to predict the classes and cross-entropy is used (?), while the decoder and the self supervised loss does not play any part here.
Quality:
The results are competitive with SOTA
Novelty:
Incremental, but novel. I have not seen this encoder before, nor the extension of [He et al. 2022] to mesh data.
Reproducibility:
From the paper a competent practitione could reimplement it.
The paper is not bad or wrong, but it does not advance the field significantly enough. It does not break new ground in performance, does not provide surprising new insights or has a new theory.
Official Review of Paper776 by Reviewer tuMB
Official Review of Paper776 by Reviewer tuMB
The paper presents mask autoencoders for meshes. The idea is motivated by masked autoencoders in the case of images. The work treats meshes as graphs with mesh faces as nodes and edges to capture mesh topology/connectivity. Masking is done by randomly removing nodes/faces along with associated edges, similar to the idea of masked autoencoders. The actual method is a pretty direct application of masked AE to the graph network setting by adopting face attention mechanism. Comparison is provided on Shrec11 and modelNet40 (both datasets are close to being saturated), with very marginal improvement, if any.
The paper has limited novelty and results in very marginal improvements.
- Uses masked AE in the context of meshes
- A simple adaptation and application to graph neural networks
- Very limited novelty
- Performance improvements are marginal at best
- Scores low wrt novelty and performance. Would have liked to see test on more challenging datasets or real/scan data.
Section 3 is simple and concise. It is appropriate given the (limited) contribution of the work.
Novelty is low. A pretty direct adaptation of masked AE for images.
Should be reproducible.
Limited novelty and very marginal performance enhancement prompt me to give a low score.
Official Review of Paper776 by Reviewer hqAC
Official Review of Paper776 by Reviewer hqAC
In this paper, the authors propose a self-supervised method to learn mesh representation. The proposed method uses graph attention layer to build the model and the graph masking strategy to train the model. The learned global feature is trained with the task to reconstruct the point clouds. The trained model is evaluated on shape classification and segmentation tasks on SHREC11 and ModelNet40 datasets.
Pros
-This paper is well written and easy to follow.
-The proposed representation learning is self-supervised without extra annotation efforts. It also takes advantage of the graph attention layer and graph masking to learn effective representation.
-Experimental results shows the proposed method achieves comparable performance against the existing methods on the shape classification and segmentation tasks. The analysis on the effect of masking ratios at training and evaluation stages reveals that different ratios lead to performance variation.
Cons
-The technical contribution of this work is limited given the similar work on graph attention layer for point cloud/mesh processing and masking strategy in image processing. The studies of the effect of the training and evaluation ratios is more an empirical study than a contribution as no guidance is provided on how to select the optimal values without trying all possible values.
- In Section 4.2, the best model is picked according to the lowest Chamfer Distance on provided test data. Although the authors argue there is no label leakage, the involvement of test data in the selecting of models is not acceptable. What is the averaged performance of the performance of a randomly picked model?
-The authors argues one reason that their method outperforms others in Table 3 is the attention mechanism. However, evidence of the learned attention is expected to be provided to support the statement.
-In the text, the classification task is put before the task of segmentation, which is different from the order in the figures 3,9, and 10. The authors are advised to make consistent order.
-In Table 4, it is unclear if the Multi-Task model is the one from the reference Hassani & Haley, 2019. Furthermore, more baselines on this task are expected to be added to Table 4.
-In Figure 4, both (b) and (c) are about the train and test masking ratios.
-It is unclear what the authors intend to state at the end of Section 5.1 'Second, having the same masking ratio could make the model rely on finding masking information from the mesh'. It is unclear how finding masking information can help the task of classification or segmentation.
-The networks for classification and segmentation are not shown in either the main text or appendix. It is weird to me the author set 1-ring neighbor for neighboring lookup in the first two mesh graph attention blocks and 2-ring neighbors for the last layer. The authors may explain the reason for the setting.
-In Section 4.1, 'other night methods' should be 'other nine methods'. The format of references in Table 3 are incorrect.
The paper is easy to follow. It combines existing techniques and conduct some further experimental analysis on the setting of masking ratios. However, the contribution is limited.
Overall, this paper lacks technical contribution and has unclear details as stated in the weaknesses part.
Official Review of Paper776 by Reviewer Z7zm
Official Review of Paper776 by Reviewer Z7zm
The paper proposes a masked autoencoder architecture for 3D meshes. The input mesh is processed to consider the faces as nodes of a connected non-oriented graph. Then, some node of the so obtained graph is masked and passed to attention layers; a final max-pool produces the graph embedding that can be used for different applications. The method is tested on the Classification and Part segmentation tasks, with some qualitative results about shape reconstruction, showing results on par with SoTA.
STRENGTH
S1) RESEARCH DIRECTION: Looking for adaptation of masking and attention mechanism on mesh data is a compelling and quite active research field in the Geometric Deep Learning Community.
S2) MASK DISCUSSION: The paper discusses the performance variation at a different portion of masking. Such analysis is useful for future works, pointing out that there is a margin to find more sophisticated masking policies.
WEAKNESSES
W1) NOVELTY: The proposed method does not introduce any particular methodological novelty. Attention mechanisms are already applied to 3D shapes (e.g., "Shape registration in the time of transformers", Trappolini et al., 2021), as well as masking techniques on meshes (Liang et al. 2022). On graphs, some attempts can be mentioned as well (e.g., "MGAE: Masked Autoencoders for Self-Supervised Learning on Graphs", Tan et al., 2022, "Graph Masked Autoencoders with Transformers", Zhang et al., 2022). Further references can be found in a recent survey ("A Survey on Masked Autoencoder for Self-supervised Learning in Vision and Beyond", Zhang et al., 2022). Similarly, several references can be found in "Transformers in 3D Point Clouds: A Survey", Lu et al., 2022. I am particularly concerned by (Liang et al. 2022), since it is mentioned in the related work without providing its positioning w.r.t. paper contribution. I think a proper discussion of the methodological novelty should be provided.
W2) RELATED WORKS AND COMPARISONS: I think there are significant works that are not discussed and on which a comparison would be relevant. DiffusionNet (Sharp et al., 2020) and DeltaConv (Wiersma et al., 2022) are both SoTA networks for meshes and pointcloud that show results on the same datasets used in this paper (also, DeltaConv has the best results on ModelNet40). I also suggest modifying both "Self-Supervised Learning" and "Transformer Applications" paragraphs to be more focused on 3D data and, in particular meshes.
W3) EXPERIMENTS: The architecture is quantitatively tested only on two datasets, of which previous methods already saturate one. It is not clear the applicability of the method on more challenging and advanced tasks (e.g., protein segmentation, shape matching, non-rigid objects segmentation/classification). I think this significantly limits the impact of the work and does not clarify the real applicative scenarios of the method. Finally, while the results are promising, in general, they do not fully reach the state of the art.
The novelty of the method is among my main concerns. Similar ideas have already been proposed in the same domain. I have no particular doubts about the method's reproducibility. The paper is overall clear, while the introduction contain vague statements which may convey a wrong message:
- "mesh, an irregular data format" -> "irregular" is unclear; since they enjoy more structure than a point cloud.
- "Traditional studies do not consider unlabeled data" -> what does "traditional" mean exactly?
MINOR FIXES I have also found some typos:
- "neighing nodes' -> Neighbourhood nodes?
- "Equation1to" -> Missing a space
- "we first verify" -> Missing capital letter
My principal concerns are about the novelty of the method and the chosen experimental setting. At the present state, I do not think the work is ready for publication. I am looking forward to the rebuttal for a discussion about the work's original contribution and positioning also w.r.t. missing methods.
Submission Withdrawn by the Authors
Withdraw by Paper776 Authors • Submission Withdrawn by the Authors