This document is formatted according to :
a) reviewer comment 
b) answer (to reviewer)
c) action taken (for coauthors)

REVIEWER 3 
_______________


Weakness 1:


a) The models explored in the paper are quite limited. With only 18-layer ResNet, it is unclear if the conclusion is widely applicable. It would be better to explore more recent architectures or attention-based models commonly used in medical imaging

b) We thank the reviewer for this suggestion. During the development of our pipeline, we extensively piloted several architectures, including EfficientNet and Vision Transformers (ViT). While EfficientNet yielded similar trends, it demonstrated slower convergence and slightly poorer generalisation in our specific 3D MRI context compared to the ResNet-18. Regarding ViT, we encountered significant GPU memory limitations when attempting a 3D implementation with standard hyperparameters (e.g., $16 \times 16 \times 16$ patch sizes). To ensure a mathematically "fair" comparison between 2D and 3D dimensionalities, we required a shared backbone where hyperparameters remained consistent across both settings. ResNet-18 was selected as it provides a robust, established baseline with high feature generalisation and computational efficiency, ensuring that the performance differences we observed were attributable to data dimensionality rather than architecture-specific tuning.

c) None.

Weakness 2:


a) The explainability method is also limited. The paper primarily relies on back propagation based explainability model, which may not fully capture the model behavior, given known limitations of gradient-based saliency methods.

b) We appreciate the reviewer’s thoughtful critique of gradient-based methods. We are well aware of the established limitations in this field, such as gradient saturation and shattering. However, in our study, we monitored for saturation (where gradients collapse to zero), and found that our methodology—which ranks relative pixel importance—successfully maintained informative attribution maps across the cohort. Regarding "shattered" or noisy gradients, we found that our aggregation approach (averaging saliency across the test set as shown in Fig. 2) naturally serves as a denoising mechanism. By focusing on the distribution of activations rather than individual noisy samples, we mitigate the stochasticity inherent in single-map backpropagation. We believe this provides a faithful representation of the model’s global logic, which is the primary focus of this study.

c) None.



Weakness 3:


a) Although the authors use some augmentations, the 3D setup is constrained by voxel anisotropy and limited modality choices, which may underrepresent the potential of volumetric modeling. This may explain why the 3D models does not bring about additional benefits compared to 2D methods.



b)  The reviewer correctly identifies voxel anisotropy as a significant challenge in MRI-based deep learning. In our study, the slice thickness is significantly larger than the in-plane resolution, creating a "non-cubic" voxel that complicates 3D spatial learning.

    We explored several strategies to mitigate this:

        1. Isotropic Resampling (Downscaling): We attempted to downscale the $x$-$y$ resolution to match the $z$ resolution. However, this resulted in a substantial loss of critical fine-texture features necessary for distinguishing MASLD/MASH.

        2. Upsampling (Interpolation): We attempted to upsample the $z$-axis to match the in-plane resolution, but this introduced unphysical "staircase" artifacts and interpolation noise that hindered the model's ability to learn real tissue characteristics.

        3. Given that patient positioning in the UKBB protocol is highly standardized, we found that 3D rotations across anisotropic axes did not yield meaningful anatomical variations and risked further data degradation, while they also required on-the-fly expensive resampling operations. Consequently, we utilized the MONAI framework to implement the most robust 3D augmentations possible—such as intensity shifts and axial-plane-constrained rotations—that respect the physical constraints of the data. Our results accurately reflect the performance of volumetric modeling in real-world, anisotropic MRI cohorts. We would, however, welcome any specific suggestions the reviewer may have regarding alternative volumetric modeling approaches that we could explore in future research.

c) Nothing



Detailed Comments:


a)  Thresholds such as the top 1% activation criterion and specific augmentation probabilities could be better motivated or supported by ablation studies.

b) We thank the reviewer for the comment regarding our hyperparameter selection. The top 1% activation threshold was selected following a qualitative pilot study where we evaluated a range of thresholds from 0.5% to 5% for a small sample of test images. We observed that higher thresholds (e.g., 5%) included several disconnected regions. Conversely, lower thresholds (e.g., 0.5%) often resulted in selecting only sub-groups of regions, that failed to capture the entirety of a connected component in the top-ranked pixels of the saliency maps.

We found that the 1% criterion consistently provided the optimal balance: it effectively suppressed stochastic background noise while preserving the structural integrity and "scannability" of the liver attribution regions. `This level of sparsity ensures that the saliency maps are interpretable for medical practitioners—a primary objective of our transparency-focused approach. Regarding augmentation probabilities, these were tuned to maximize training stability and prevent overfitting on the anisotropic 3D volumes. While we agree that exhaustive ablation studies for every parameter are valuable, we believe our current settings represent a robust and anatomically valid configuration for this cohort.

c) Nothing 




Questions to Address in the  Rebuttal:


a)  The study acknowledges label noise due to the temporal gap between MRI acquisition and biomarkers used for MASLD/MASH stratification. Can the authors quantify how performance varies as a function of the time gap, and clarify to what extent the noise issue is mitigated?

b) We thank the reviewer for this question, as it allows us to clarify the experimental design of our label noise analysis. This specific quantification is presented in Section 3.2 and Table 4, supported by the statistical analysis in the final paragraph of the Results section.To isolate the effect of temporal misalignment, we maintained a fixed, high-quality test set (comprising participants with the shortest time gaps) while training separate model instances on three distinct subsets partitioned by time gap: Short (6.0–9.1 years), Medium (9.1–10.3 years), and Long (10.3–12.9 years). As shown in Table 4, classification metrics remained remarkably stable despite the significant temporal discrepancies in the training labels:Healthy Group: AUC remained identical at 0.89 across all subsets ($H=0.0, p=1.0$; the $H$-statistic of 0 and $p$-value of 1.0 indicate that the distributions of prediction scores are mathematically identical across time gaps).MASLD Group: Showed only minor, non-significant variation ($H=2.44, p=0.295$; the $p$-value well above 0.05 indicates that we cannot reject the null hypothesis that performance is the same across all time-gap groups).MASH Group: While this group showed the most variance—expected given the complexity of fibrosis—the results remained statistically non-significant ($H=4.79, p=0.091$; even at the longest time gap of 12.9 years, the performance does not deviate significantly from the short-gap baseline).These results provide a direct quantification of performance as a function of the time gap. We believe the noise is effectively mitigated by the model’s inherent robustness and the large-scale nature of the UKBB cohort ($N=18,073$). By training on thousands of examples, the network learns to extract the consistent hepatic features associated with each disease stage, effectively "averaging out" the stochastic label noise caused by individual longitudinal disease progression or regression over the ~10-year window.

c) Nothing

