\section{Experiments}
% Results}
\label{sec:sec4}
\subsection{Comparison and Ablation Studies}
To assess our modulation method, we conducted experiments across all considered applications using two setups: \textbf{(i)} conventional convolutions (No MOD), and \textbf{(ii)} modulated convolutions as in \Sec{sec3.1}, including possible configurations:  MOD S, MOD M, and MOD L, corresponding to input/output feature sizes of (32, 8), (32, 16), and (32, 32), respectively.

In addition, for MRI experiments where instance normalization is used within the U-Net convolutions, we evaluate an adaptive instance normalization variant, in which all instance normalization layers are replaced by adaptive instance normalization conditioned via MLPs with 32 and 16 hidden features (AdaIn M). To assess our method further, we apply our proposed modulation only at the input of the network, by modulating the very first convolutional block in the U-Net encoder, while keeping all subsequent encoder, bottleneck, decoder, and output layers unmodulated, thereby substantially reducing the number of additional parameters introduced by modulation (MOD M - inp-only).

\subsection{Quantitative Analysis}
For our quantitative comparative analysis, we utilized established metrics in image processing to evaluate the performance of our experiments. These metrics include the Structural Similarity Index Measure (SSIM), peak Signal-to-Noise Ratio (pSNR), Normalized Mean Squared Error (NMSE) specifically for Accelerated MRI reconstruction accuracy, and Mean Average Error (MAE/L1) for CT or CBCT reconstruction. The mathematical formulations for these metrics can be found in \Appendix{appendix3-metrics}.

\subsection{Accelerated MRI Reconstruction}
\label{sec:sec4.3}
\subsubsection{Datasets}

We evaluate our method using two distinct datasets for 2D reconstruction. Specifically, we utilized the prostate \cite{tibrewala2023fastmri} and knee \cite{zbontar2019fastmri} fastMRI datasets which comprise raw fully-sampled  $k$-space data. The prostate data contain T2-weighted scans with 10--30 coils; the knee data comprise coronal proton-density--weighted scans acquired with 16 coils. For the prostate dataset, we used 218 subjects (6,647 slices) for training, 48 (1,462 slices) for validation, and 46 (1,399 slices) for testing. For the knee dataset, we used 973 volumes (34,742 slices) for training, 99 (3,573 slices) for validation, and 100 (3,562 slices) for testing. Our training involved retrospective undersampling of the data, while utilizing the fully-sampled  measurements for loss calculation.

In addition, we used the Cardiac MRI Reconstruction 2025 \cite{b6xs-gv29-25} training dataset for 2D dynamic reconstruction experiments. This dataset contains multi--field strength data acquired at both 1.5T and 3T and includes multiple acquisition sequences, with data acquired using 10 coils. We split the data into 444 4D scans (1,970 2D+time series) for training, 74 scans (331 2D+time series) for validation, and 76 scans (275 2D+time series) for testing, ensuring a balanced distribution of 1.5T and 3T acquisitions across splits.


\subsubsection{Undersampling}
To simulate various acceleration factors ($R$), we applied undersampling to our initially fully-sampled dataset, carefully preserving a specific fraction ($r_\text{acs}$) of the data in the autocalibration region for each factor.  Importantly, these two parameters $R$ and $r_\text{acs}$, served as auxiliary variables for model modulation. For training, we randomly selected $R$  within $[4, 16]$, favoring higher acceleration factors (e.g., factors near 16 were chosen four times as often as those near 4) through a continuous triangular distribution (see \Appendix{appendix3-triang-dist}). The $r_\text{acs}$ values were randomly picked from a uniform distribution in the range [0.02, 0.08]. In the testing phase, the models were evaluated at predefined $R$ values (4, 6, 8, 10, 12, 14, and 16) and their corresponding $r_\text{acs}$ values (0.08, 0.06, 0.04, 0.035, 0.03, 0.025, and 0.02, respectively). We adopted an equispaced undersampling approach, as it aligns with common practices in DL-based MRI reconstruction and offers straightforward implementation for clinical applications.

\subsubsection{Modulation Auxiliary Variables}
To modulate our convolutional networks in the Accelerated MRI Reconstruction setup, we utilize as auxiliary variables the acceleration factor ($R$) of each sample, as well as the ACS fraction ($r_\text{acs}$) as defined in \Appendix{appendix1-mri-subsampling}. Where available and varied in the datasets, we include the field strength ($F$) of the sample as an auxiliary variable. More precisely, we use:
% 
\begin{equation}
    \vec{z} = \log([R,  \, 100\cdot r_\text{acs}]) \in \mathbb{R}^2 \quad \text{or} \quad \vec{z} = \log([R,  \, 100\cdot r_\text{acs}, \, F]) \in \mathbb{R}^3.     
\end{equation}
% 

\subsubsection{Training and Optimization Strategy}
Models were implemented in PyTorch \cite{paszke2017automatic} and optimized using Adam \cite{kingma2017adam} with $(\beta_1, \beta_2) = (0.9, 0.999)$ and $\epsilon = 1\mathrm{e}{-8}$. Training was performed on NVIDIA A100 or H100 GPUs, using batch sizes of 2 or 1 for static and dynamic reconstruction, respectively. Static models were trained for 150{,}000 iterations, while dynamic models were trained for 80{,}000 iterations. The learning rate schedule linearly increased from $6.7\mathrm{e}{-4}$ to $2\mathrm{e}{-3}$ over the first 1{,}000 iterations, followed by a $20\%$ decay every 30{,}000 iterations. Across all experiments, random data augmentations were applied during training, including cropping, flipping, and rotation, to improve robustness and learning efficacy.

The vSHARP reconstruction models followed the architectural and loss design choices of \cite{yiasemis2023vsharp}, using multi-scale U-Nets for denoising and sensitivity estimation, with task-specific configurations for static and dynamic reconstruction. A dual-domain loss combining image- and $k$-space-based terms was employed. Full architectural details, augmentation specifications, and hyperparameter choices are provided in \Appendix{training_details}.



\subsubsection{Results}
\input{tabs/mri_results}
\input{tabs/mri_results_cardiac}
Metrics are calculated between the magnitude of the ground truth image and the magnitude of the predicted image. Note that for both datasets we compute the quantitative results on the central  320 $\times$ 320 reconstructed image region. The average quantitative SSIM and pSNR results for 2D reconstruction are detailed in \Tab{mri_results_ssim_psnr} (for $R=4-10$) and \Tab{mri_results_ssim_psnr_10_16} (for $R=12-16$) and NMSE in  \Tab{mri_results_nmse} (NMSE). The  results for the 2D dynamic reconstruction are provided in \Tab{mri_ssim_psnr_results_cardiac}. In overall, our findings reveal a consistent trend: the models equipped with modulated convolutions consistently outperform their non-modulated counterparts, showcasing superior performance in both prostate and knee dataset reconstructions. 

\input{figs/mri_sample}
\input{figs/mri_sample_cardiac}

For the knee dataset, MOD M emerges as the top performer on average, though it's noteworthy that all modulation variants (MOD S, M, and L) outperform the baseline models, with only three exceptions observed at specific acceleration factors (R=6 for pSNR and R=8 for SSIM with MOD S, R=4 with MOD M - inp-only). Conversely, in the prostate dataset, the non-modulated models showed slightly better performance than MOD S and MOD M in certain cases ($R=$8,10,12). However, MOD L consistently surpassed the performance of the non-modulated models.

To better contextualize the results, we additionally report the parameter count and inference time of all evaluated models in \Tab{2d_mri_params_time}. This analysis reveals that while larger modulation variants such as MOD M and  L substantially increase the number of learnable parameters, their inference times remain comparable to the non-modulated baseline.

A key insight from our comprehensive evaluation in the accelerated MRI reconstruction context is the more pronounced improvement in reconstruction metrics offered by modulated methods over non-modulated ones, especially at higher acceleration factors. Another noteworthy observation is that while MOD M achieves the strongest overall performance, MOD M with input modulation only also consistently outperforms the non-modulated baseline at a substantially lower parameter count (see \Tab{2d_mri_params_time}),
% This finding underscores the substantial advantage of incorporating modulation in challenging reconstruction scenarios.



For qualitative assessment in  Figures \ref{fig:mri_recon} and \ref{fig:mri_recon_cardiac} we depict example reconstructions from the knee and cardiac test datasets at  $R=16,r_{\text{acs}}=0.02$ and $R=4,r_{\text{acs}}=0.08,F=3$, demonstrating the advantage of modulation.

% , providing visibly improved reconstructions over the non-modulated model.

% In Supplementary Material \textcolor{red}{B.3} we investigate further experimental settings for different types of the generalized modulation as presented in Supplementary Material \textcolor{red}{A.1}. The results confirm our original method's superiority in most scenarios. Interestingly, `full modulation' setting where all components of the kernel are learnable, similar to the continuous kernel convolutions from \cite{Romero2022}, did not result in an improvement over simpler modulated convolutions. See Supplementary Material \textcolor{red}{B.3} for further details.





\subsection{Computed Tomography}
\label{sec:sec4.4}
\subsubsection{Datasets}
For the Cone-beam CT experiments, we used an internal dataset of $424$ diagnostic thorax CT scans with isotropic spacing of $1$ mm that were dowsampled to $2$ mm resolution. The dataset  was split into a training set of 260 scans, a validation set of 22 scans and a test set of 142 scans. We simulated a clinical acquisition geometry for a Linac-integrated CBCT scanner from Elekta AB, Stockholm, Sweden\cite{Letourneau2005} with a medium field-of-view setting, offset detector, a full $2 \pi$ scanning trajectory and either (a) $64$ projections to simulate low projection count observed in e.g. phase-resolved 4D CBCT reconstruction or (b) a variable projection count $\mathrm{N_{proj}} \in [237,720]$ to simulate variability observed in e.g. pelvic CBCT acquisitions. The source-isocenter distance was set to $1000$ mm and the isocenter-detector plane distance was set to $536$ mm. The detector was offset by $115$ mm to the side \cite{SSharma2014} to give an increased Field of View. Square detector panel with a side of $409.6$ mm and $256 \times 256$ pixel matrix was used.

For the Fan-beam CT, we used a subset of the Mayo Clinic dataset for the AAPM Low Dose CT Grand Challenge \cite{McCollough2016}, which was split into training (2961 slices), validation (358 slices) and test (1618 slices) sets. Slices belonging to each subject were assigned to exactly one of the train/validation/test folds. To simulate Fan-beam CT acquisitions, we implemented a fan-beam geometry with source-isocenter and isocenter-detector distances set to $500$ mm. Detector size was set to $720$ mm with $1000$ pixels.

\input{figs/cbct_samples}

\subsubsection{Training and Optimization details}
\paragraph{Model Optimization} Models were developed in PyTorch, using Adam with parameters $(\beta_1, \beta_2) = (0.9, 0.999)$ and $\epsilon$=1e-8.  Experiments were carried out on NVIDIA Quadro 8000 GPUs, with a batch size of 8 for Cone-beam CT (using gradient accumulation) and batch size of 16 for Fan-beam CT. For CBCT experiments, plateau learning rate scheduler with linear warm-up during first 130 iterations and evaluation after every 130 iterations was used, learning rate was reduced by a factor of two if no improvements was observed after 5 evaluations. For Fan-beam CT, warm-up period was 1k iterations and evaluation took place every 10k iterations. The training was terminated after learning rate became smaller that $10^{-5}$, which resulted in iteration count between 33k and 34k for CBCT, and 700k and 850k for the Fan-beam CT models. Model with the best MAE evaluation metric was tested.

\paragraph{Random Augmentations} For the CBCT experiments, we randomly augmented the volumes by flipping left/right and top/bottom sides of the patient. For the Fan-beam CT experiment, we randomly augmented the slices by flipping the left/right side of the patient. 

\paragraph{Reconstruction Model Hyperparameter and Loss Function Choice}
Our implementation of $\partial$U-Net relies on the open-source implementation \cite{HauptmannCode} from the authors, where the base filter count was increased from 12 to 32 to increase expressive power but fit into the memory budget. We replaced batch normalization layers with instance normalization layers, since batch normalization resulted in unstable convergence. Our implementation of the Learned Primal-Dual method replicates the original implementation and consists of 10 primal and 10 dual cells, each primal/dual cell being a stack of 3 convolutional layers with 32 channels in the first and the second convolutional layer. To train both LPD and $\partial$U-net, we utilised Mean Absolute Error as the loss function.



\subsubsection{Modulation Auxiliary Variables}
To modulate our convolutional networks in the Fan-beam CT setup, we utilize photon count $I_0$ as an auxiliary variable of each sample. More precisely, we let
% 
\begin{equation}
    \vec{z} := \log([I_0]) \in \mathbb R,    
\end{equation}
% 
where $I_0$ was sampled from a triangular distribution (see \Appendix{appendix3-triang-dist}) supported on the photon count range of $[2.5k, 40k]$ with $4$ times higher density at $2.5k$ compared to $40k$. For the Cone-beam CT experiment with variable photon count, we use triangular distribution supported on $[10k, 50k]$ with $4$ times higher density at $10k$ compared to $50k$. For the Cone-beam CT experiment with variable projection count, we let
% 
\begin{equation}
    \vec{z} := \log([\mathrm{N_{proj}}]) \in \mathbb R,    
\end{equation}
% 
where the projection count $\mathrm{N_{proj}}$ is sampled from $[237, 720]$ uniformly at random and the photon count is kept constant at $30k$.

\subsubsection{Results}
MAE is calculated between attenuation arrays converted to HU, while pSNR is computed for attenuation values directly. Results of the Cone-beam CT experiment for variable photon count are provided in Tab. \ref{tab:cbct-table} and for variable projection count in Tab. \ref{tab:cbct-table2}. Results of the Fan-beam CT experiments are presented in \Tab{ct-table}. Overall, we observe consistent improvement of $\partial$U-net model equipped with modulated convolution over the non-modulated counterpart, even though we are using the most compact version of the modulator in the CBCT experiment. We present example axial slices from the test set with photon count $I_0=10k$ and 64 projections in Fig. \ref{fig:axial-large}, showing that the modulated network resolves soft tissue details better. In the Fan-beam CT experiment, we observe that the modulated versions of LPD also generally outperform the non-modulated baseline, however, the degree of improvement is small. We conjecture that this can be a consequence of LPD being able to `learn' the amount of noise from noisy projections, since, unlike $\partial$U-net, dual blocks of LPD have direct access to the projection data.

\input{tabs/cbct_results}
