[RE] A Reproducibility Study on Scene-Graph Generation from 3D Point Clouds: Hybrid Approach with Clip, 2D Image Semantics, and 3D Geometry

[RE] A Reproducibility Study on Scene-Graph Generation from 3D Point Clouds: Hybrid Approach with Clip, 2D Image Semantics, and 3D Geometry

TMLR Paper2237 Authors

16 Feb 2024 (modified: 12 Oct 2024)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Reproducibility Summary Scope of Reproducibility This paper scrutinizes the reproducibility of VL-SAT and multimodal learning systems for 3D semantic scene graph prediction. Leveraging visual (ViT, CLIP) and linguistic semantics, our study replicates top-k accuracy results and explores models like SGFN, and SGGPoint. We assess the impact of the CLIP adapter, 2D image semantics, and conduct hyperparameter tuning. Additionally, the ablation study investigates node and edge collaboration, and the influence of a multi-head self-attention network within the VL-SAT architecture, enhancing understanding of these critical components. %\footnote{Our code can be accessed at \url{https://github.com/dnabanita7/CVPR2023-VLSAT-reproducibility/}.} %%Commented out while in double blind review Methodology We use the open-source code released by the authors to generate datasets, create point cloud data, and train and validate samples for VL-SAT. Our implementation covers 150 3D reconstructed indoor scenes from the original 1553, maintaining the 160 object classes and 26 predicate types as outlined in the paper. Additionally, we collaborate with the authors to integrate code for models SGFN, and SGGPoint into our existing code-base. Expanding upon the methodology, we meticulously implement the provided specifications, addressing any gaps to ensure a comprehensive pipeline supporting all experiments. Our experimentation uses computational resources provided by an NVIDIA GeForce GTX 3090 GPU, totalling 100 GPU hours for training. Moreover, we secure access to GPU compute resources through collaboration with the ML Collective team. Results Upon executing the authors' provided code, we encountered the necessity for substantial modifications and additions, including the incorporation of numerous files. Following these adjustments and the addition of essential segments, we conducted reproducibility tests, ablation studies, and hyperparameter tuning. Consequently, our results largely support the main claims of the paper within a significant subset of experiments. However, there are notable discrepancies in many of the actual values obtained compared to those reported. Hence, we conclude that while the paper's findings are largely replicable, achieving precise reproducibility of results requires additional efforts due to the extensive changes and additions required in the provided code. What was easy We found it easy to discern the primary assertions of the paper and the corresponding experimental evidence. Furthermore, the availability of the authors' open-source implementation facilitated ease in training the model, conducting ablation studies, and fine-tuning hyperparameters. What was difficult Configuring the datasets presented challenges primarily due to the absence of pinned dependencies, and the lack of code for generating 3D datasets resulted in delays in conducting experiments. Additionally, identifying the sources of discrepancies in our findings proved challenging, compounded by the inaccessibility of training curves and model weights or checkpoints. These limitations hindered our ability to precisely replicate the reported results and necessitated additional efforts in troubleshooting and refining our implementation. Communication with original authors At the initiation of our research endeavour, we diligently maintained ongoing communication with the authors through email channels which benefited us with their valuable insights and resources, thereby enhancing the depth and scope of our study. However, subsequent to the integration of code for the models under investigation, our attempts to engage in further correspondence with the authors were met with silence.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Antoni_B._Chan1

Submission Number: 2237

Loading