This document is formatted according to :
a) reviewer comment 
b) answer (to reviewer)
c) action taken (for coauthors)



REVIEWER 2
____________
Weakness 1 & Rebuttal Question 1

a) Key training and compute details are missing. Please report model capacity (number of parameters), batch size for training, optimizer, training time, and GPU memory usage for both the 2D and 3D settings, along with the hardware/software stack (e.g., GPU model).


Please report model capacity (number of parameters), batch size for training, optimizer, training time, and GPU memory usage for both the 2D and 3D settings, along with the hardware/software stack (e.g., GPU model).

b) We thank the reviewer for highlighting the need for greater transparency regarding computational resources and model specifications. We have added a comprehensive breakdown in the added Appendix B (Implementation Details). This section now explicitly reports the hardware specifications (NVIDIA L4 GPU), optimization scheme (Adam optimizer, batch size), model capacity (approx. 11M parameters for 2D vs. 33M for 3D), and comparative training times.

c) Updated Appendix B to include a detailed report on model parameters, batch size, optimizer settings, training duration, and hardware specifications.



Weakness 2 & Rebuttal Question 2


a) The conclusion that “3D provides no benefit over 2D” would be stronger if stated more cautiously and supported with clearer justification. Because 2D vs 3D comparisons can be confounded by differences in acquisition/protocol, preprocessing/labeling... it is difficult to interpret performance differences as purely attributable to dimensionality. It would help to (i) describe more clearly what is held constant vs different... and (ii) cite prior studies that report similar findings.

b)  We appreciate this suggestion and agree that a nuance-free comparison could be misleading. We have revised the Discussion (Section: 2D vs 3D comparison) to clarify the specific boundary conditions of our results. We now explicitly state that while the backbone architecture and label sets were held constant, the input preprocessing differed significantly due to computational constraints (segmentation + Three-Point-Dixon for 2D vs. bounding box detection + graph-cuts for 3D). Furthermore, we acknowledge that the unavailability of volumetric $T_1$ sequences in the UK Biobank is a critical confounder, as $T_1$ is essential for MASH diagnosis. Consequently, we have softened our conclusion to reflect that 2D is the superior "practical" choice given current data availability and computational constraints, rather than claiming inherent theoretical superiority.

c) Revised the "2D vs 3D comparison" subsection in the Discussion to explicitly list controlled variables vs. confounding differences (preprocessing pipelines, $T_1$ availability) and softened the conclusion to emphasize practical utility over theoretical superiority.


Weakness 3 & Rebuttal Question 3:

a) (Weakness 3): While masking does not appear to improve overall classification accuracy, it does shift model attention (as reflected by the saliency metrics in Table 3 and Fig. 3). It would strengthen the paper to clarify what practical benefit this provides, for example, whether masking can reduce training cost (fewer epochs, faster convergence) or improve data efficiency (achieving similar performance with less training data). Otherwise, given sufficient data/epochs, a ResNet-18 may learn similar cues from the original unmasked data alone, which makes the added value of masking less clear. An ablation study varying the number of training epochs and/or training-set size would help support this claim. This is especially useful for computationally expensive 3D training.

Rebuttal Question 3: Explain advantage of masking in detail.

b) We thank the reviewer for this inquiry on the potential advantages of masking. We have expanded the discussion on the "Impact of liver masking and interpretability" in Section 5. We clarify that the primary benefit of masking in our context is not performance gain or training speed, but rather safety and anatomical validity. As demonstrated by our saliency metrics (Table 3), masking significantly improves the localisation of decision-relevant features ($P$ score) within the liver. This prevents "shortcut learning" from extrahepatic features (e.g., subcutaneous fat), and ensures that the model relies on clinically relevant regions while AUC remains similar. This robustness is a critical aspect for clinical trust, independent of training efficiency.

c) Expanded the "Impact of liver masking and interpretability" section in the Discussion to clearly articulate that the value of masking lies in preventing shortcut learning and ensuring anatomical validity, rather than solely improving accuracy or convergence speed.



Detailed Comment 1:

a) In Table 1, please also highlight the MASH results for the T1-only setting (first three rows), not only the T1+PDFF configuration.

b) We thank the reviewer for their careful examination of our results. We have corrected Table 1 to accurately bold the top-performing models (and those statistically equivalent) across all health classes, including the $T_1$ MASH results, to provide a fair representation of single-modality performance.

c) Updated Table 1 formatting to bold the top-performing results for $T_1$-only inputs regarding MASH classification.




Detailed Comment 2:


a)If you used complex-valued T1 MRI images, please clarify the representation used in the network (e.g., real/imaginary as two channels).

b)  We appreciate the opportunity to clarify our data usage. The $T_1$ maps provided by the UK Biobank are quantitative maps (shMOLLI sequence), where each pixel represents a scalar relaxation time in milliseconds (ms). These are inherently real-valued, positive numbers derived from the magnitude reconstruction of the original acquisition (complex-valued). Therefore, no complex-valued representation (Real/Imaginary channels) was required or used in our network architecture. 
For a more in-depth explanation, I would recommmend: https://mriquestions.com/what-is-t1.html

c) No action, just reply to the reviewer.



Detailed Comment 3:

a) It would be helpful to comment on other potential applications of this framework beyond liver stratification and beyond MRI, and what adaptations would be needed for those settings.

b) We thank the reviewer for this forward-looking suggestion. We have adjusted the final paragraph of the Discussion and Conclusion to address the translational potential of our work. We now discuss how our framework—specifically the integration of disease stratification with explainability metrics—could be adapted for other pathologies where "gold standard" labels are noisy or unavailable. We outline that while the core methodology remains applicable, adaptations would be required regarding specific disease severity labeling and the selection of modality-appropriate explainability methods.

