This document is formatted according to :
a) reviewer comment 
b) answer (to reviewer)
c) action taken (for coauthors)



REVIEWER 1 (bsbi):
________________________



Weaknesses:

a)
The claim that the 3D model is worse than the 2D model may be caused by differences in how the images were prepared and processed and how the input of the model is designed, rather than just the use of 3D data itself. Specifically, the 3D images went through intermidiate steps and these steps may have altered the image details as the author metioned in the paper. Additionally, the 3D method relied on an automatic tool to find and crop the liver that was not perfect (about 80% accuracy), which likely introduced errors that the 2D method avoided.

b) 

Bounding Box Accuracy: We acknowledge that the automated bounding box detection has an IoU of ~80%. However, our preprocessing pipeline incorporates a padding strategy to include a larger volumetric domain around the detected box. This ensures that the liver is fully captured within the input volume, even if the initial detection is imperfect, effectively mitigating the risk of cropping out relevant liver regions. We have clarified this procedure in the last paragraph of Appendix A, right after reporting predicted bounding box errors. We also note that processing full-scale 3D MRI volumes without cropping is currently infeasible due to GPU memory constraints (OOM errors).

Preprocessing Differences: We agree that the graph-cut stitching used for volumetric UKBB data introduces minor intensity shifts and registration artefacts not present in 2D slices. While these are inherent to the provided data and unavoidable, this motivates the use of aggressive data augmentation (intensity and contrast shifts) in both 2D and 3D pipelines to improve robustness against such variations. We made this more explicitly visible in the  2D vs 3D comparison part of the Discussion and Conclusion part of the paper.

c) Action Taken: Updated Appendix A to explicitly describe the padding strategy used to counteract imperfect bounding box detection: "To counteract the effect of imperfections in our liver [detection]... within the volumetric inputs..."


Detailed comments:

a) You may want to define the meaning of 3D masking in the table 1.

b) We thank the reviewer for pointing out this ambiguity. We have updated the notation in Table 1 from "3D" to "BB" (Bounding Box) to clearly distinguish it from segmentation masking in 2D. Furthermore, we have expanded the first paragraph of the Results section to explicitly state that 3D inputs were cropped using the bounding box method described in Appendix A, as well as added a column to clarify the dimensionality of each input (2D/3D).

c) Changed Masking for Table 1 from 3D --> 3D BB AND changed the first paragraph of Results to include a more detailed description of what 3DBB means in this context, in connection to Appendix A reference.



Detailed comments:

a) To enhance reproducibility, please provide the code for the paper or specify the details of the network structures in the Appendix. A reference to an '18-layer ResNet architecture' is too general.


b) We appreciate the reviewer’s emphasis on reproducibility. While strict data privacy regulations regarding the UK Biobank dataset prevent us from sharing the full pipeline code, we have significantly expanded on model details, by adding Appendix B. This section now includes the specific hyperparameters used, detailed architecture specifications, and a link to the open-source GitHub repository that served as the backbone for our ResNet3D implementation. This ensures that other researchers can reproduce the model architecture and training conditions on their data.


c) Updated Appendix B to include full architectural details, hyperparameters, and a reference/link to the base model repository. Added a citation to this Appendix in the main text.

Questions to address in rebutal:

a) On table 1, for line 6, 9 and 12: Data Masking Healthy MASLD MASH PDFF LRI 0.86 ± 0.00 0.94 ± 0.01 0.72 ± 0.01 T1 + PDFF LRI 0.88 ± 0.01 0.95 ± 0.01 0.77 ± 0.02 0.88 ± 0.01 PDFF 3D 0.85 ± 0.01 0.89 ± 0.01 0.73 ± 0.01

b) We thank the reviewer for their careful examination of our results. We have corrected Table 1 to accurately bold the top-performing models across all health classes by highlighting the T1 MASH results.

c) Corrected Table 1 bold formatting to highlight the top-performing results for T1-(OI, AM, LRI)-MASH consistently

Questions to address in rebuttal:
a) Is it possible that the performance advantage of the 2D model is driven by the availability of T1  inputs?

b) This is an excellent point. We agree that the unavailability of volumetric (3D) T1 sequences is a major factor contributing to the performance gap, particularly for MASH diagnosis, where T1 signals are critical. While 2D PDFF performed comparably to 3D PDFF, the overall superiority of the 2D approach in our study is indeed driven by the synergy between T1 and PDFF—a combination currently only possible in 2D due to data availability. We have expanded the Discussion (Section: 2D vs 3D comparison) to explicitly acknowledge that the lack of 3D T1 data limits a fully balanced comparison, and that future availability of volumetric T1 sequences could alter these conclusions.



