\documentclass{midl}
\usepackage{amsmath}

% The following packages will be automatically loaded:
% jmlr, amsmath, amssymb, natbib, graphicx, url, algorithm2e
% ifoddpage, relsize and probably more
% make sure they are installed with your latex distribution

\usepackage{mwe} % to get dummy images
\usepackage{booktabs}
\usepackage{multirow}
% \usepackage{subcaption}

\jmlrvolume{-- Under Review}
\jmlryear{2026}
\jmlrworkshop{Full Paper -- MIDL 2026 submission}
\editors{Under Review for MIDL 2026}

\title[Sparse Subspace Diffusion Model]{Sparse Subspace Diffusion Model for Physically Consistent Accelerated MRI Reconstruction}

 % Use \Name{Author Name} to specify the name.
 % If the surname contains spaces, enclose the surname
 % in braces, e.g. \Name{John {Smith Jones}} similarly
 % if the name has a "von" part, e.g \Name{Jane {de Winter}}.
 % If the first letter in the forenames is a diacritic
 % enclose the diacritic in braces, e.g. \Name{{\'E}louise Smith}

 % Two authors with the same address
 % \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\and
 %  \Name{Author Name2} \Email{xyz@sample.edu}\\
 %  \addr Address}

 % Three or more authors with the same address:
 % \midlauthor{\Name{Author Name1} \Email{an1@sample.edu}\\
 %  \Name{Author Name2} \Email{an2@sample.edu}\\
 %  \Name{Author Name3} \Email{an3@sample.edu}\\
 %  \addr Address}


% Authors with different addresses:
% \midlauthor{\Name{Author Name1} \Email{abc@sample.edu}\\
% \addr Address 1
% \AND
% \Name{Author Name2} \Email{xyz@sample.edu}\\
% \addr Address 2
% }

%\footnotetext[1]{Contributed equally}

% More complicate cases, e.g. with dual affiliations and joint authorship
\midlauthor{
\Name{Xiangyao Deng\nametag{$^{1}$}} \Email{xiangyao.deng@monash.edu}\\
\Name{Zhiqiang Shen\nametag{$^{1}$}} \Email{xxszqyy@gmail.com}\\
\Name{Sanuwani Dayarathna\nametag{$^{1}$}} \Email{sanuwani.hewamunasinghe@monash.edu}\\
\Name{Juan P. Meneses\nametag{$^{2}$}} \Email{juanpablo.menesescasanova@monash.edu}\\
\Name{Sergio Uribe\nametag{$^{2}$}} \Email{sergio.uribe@monash.edu}\\
\Name{Zhaolin Chen\nametag{$^{1,3}$}} \Email{zhaolin.chen@monash.edu}\\
\addr $^{1}$ Department of Data Science and AI, Monash University, Melbourne, Australia \AND
\addr $^{2}$ Department of Medical Imaging and Radiation Sciences, Monash University, Melbourne, Australia \AND
\addr $^{3}$ Monash Biomedical Imaging, Monash University, Melbourne, Australia
}

\begin{document}

\maketitle

\begin{abstract}
Magnetic resonance imaging (MRI) provides excellent soft-tissue contrast but suffers from long acquisition times. Accelerated MRI alleviates this issue by undersampling k-space, but this approach introduces aliasing artifacts and information loss. Traditional compressed sensing methods exploit handcrafted sparse priors, whereas deep learning approaches learn data-driven priors, but both often struggle at high acceleration rates due to severe information degradation. 
This study introduces a diffusion-based reconstruction framework, termed the Sparse Subspace Diffusion Model (SSDM), that performs MRI reconstruction within an adaptive sparse space. The proposed approach integrates coupling convolutional dictionary learning with diffusion-based generative modeling to decompose MR images into multiple orthogonal sparse subspaces and reconstruct them under measurement-consistency constraints. This formulation enables diffusion modeling in a physically meaningful latent space, effectively bridging the gap between data-driven learning and physics-guided reconstruction. 
Experimental results on the fastMRI dataset demonstrate that the proposed method achieves higher reconstruction quality than existing diffusion- and sparsity-based approaches, with better preservation of fine details and suppression of artifacts across various acceleration factors.
\end{abstract}

\begin{keywords}
Magnetic Resonance Imaging (MRI), fastMRI, Diffusion model, Convolutional Dictionary Learning, Physics-informed learning.
\end{keywords}

\section{Introduction}
Magnetic resonance imaging (MRI) provides excellent soft-tissue contrast but is limited by lengthy acquisition times. Accelerating the imaging process through undersampling the signal would cause aliasing artifacts and information loss. Compressed sensing (CS)~\cite{lustig2007sparse} constrains image reconstruction with sparse priors to recover high-quality images from incomplete data, but its reliance on handcrafted transforms limits adaptability across different scanning scenarios. In recent years, deep learning-based MRI reconstruction methods have significantly improved reconstruction performance by learning data-driven priors~\cite{Schlemper_2017_DeepCascadeConvolutionalNeuralNetworksDynamicMRImageReconstruction}. Generative models have been applied to MRI reconstruction, achieving high-fidelity inversion by learning image priors and incorporating measurement consistency constraints~\cite{chung2022score, chung2024decomposed, chung2022diffusion, wang2022zero}. Among these, {measurement}-consistent unconditional diffusion models require no paired training data and can generalize across multiple sampling patterns, thus demonstrating significant advantages in accelerated MRI. 

In accelerated magnetic resonance imaging (MRI), aliasing artifacts arise from severe signal mixing caused by undersampling, posing significant challenges for both discriminative and generative reconstruction models. Under such conditions, energy leakage from the true signal leads to strong entanglement between aliasing components and real anatomical structures, rendering reliable signal recovery highly ill-posed. Meanwhile, MR images lie in a complex representation space comprising multiple features or frequency components. Conventional deep learning methods \cite{esser2021taming} typically rely on convolutional operations to map images into highly entangled latent representations, where different feature components lack sufficient separability. As a result, these models often produce perceptually plausible but feature-inconsistent reconstructions, without guaranteeing faithful recovery of the underlying signal components \cite{antun2020instabilities}. Motivated by these observations, we adopt a divide-and-conquer strategy that explicitly decomposes the complex image representation into multiple sparse subspaces and performs reconstruction independently within each subspace. 
{In prior work, Sub-DM\cite{guan2024sub} places different frequency components in the channel dimension via wavelet decomposition and jointly models them within a single diffusion trajectory, primarily to reshape the data geometry and stabilize the score field. Generative subspace diffusion\cite{jing2022subspace} defines subspaces as noise-scale-dependent nested supports that progressively expand during sampling. In contrast, we redefine sparse subspaces as independent learning instances, reducing the modeling burden on any single instance. Specifically, each image is decomposed into multiple sparse subspaces, organized as independent samples along the batch dimension, modeled in parallel by a shared diffusion backbone, and their evolution is explicitly coupled through cross-batch attention.}
Specifically, we construct independent sparse subspaces via coupled convolutional dictionary learning \cite{yang2012coupled}, transforming the original high-complexity global modeling problem into a set of low-complexity subspace reconstruction tasks. To maintain global structural consistency, we further introduce a lightweight cross-subspace attention mechanism into our backbone to facilitate information exchange among subspaces during reconstruction.

Based on this design, we introduce a Sparse Subspace Diffusion Model (SSDM). This two-stage MRI reconstruction framework integrates coupled convolutional dictionary learning with diffusion-based generative modeling in an adaptive sparse space. Through a {measurement-consistent} guided reverse-diffusion process, the proposed framework achieves effective subspace-wise reconstruction and enables high-quality MRI reconstruction under severe undersampling. Experimental results on the fastMRI dataset demonstrate that, compared to existing diffusion-based reconstruction methods, the proposed framework achieves the advantages in both reconstruction fidelity and artifact suppression. 

\begin{figure}[!t]
  \centering
  \includegraphics[width=\textwidth]{Figures/Overview.pdf}
  \caption{\textbf{Overview of the proposed sparse subspace diffusion model.}
 (a) The sparse transformation of the undersampled image provides physics-based guidance to the diffusion process through the measurement consistency constraint; (b) The unconditional diffusion model is trained on fully sampled subspaces in the forward process and, during the reverse process, generates subspaces guided by physics-informed priors from the undersampled image.}
  \label{fig:overview}
\end{figure}

\section{Methods}
The forward model of MRI reconstruction is written as follows,
\begin{equation}\label{eq:1}
    y=Ax + \mathcal{\epsilon}.
\end{equation}
% The measurement matrix $A \in \mathbb{C}^{m \times n}$, where $m \ll n$, models undersampled acquisition,  $A = M\mathcal{F}$ where $\mathcal{F}$ represents the discrete Fourier transform and ${M}$ is a binary sampling mask in k-space. 
The measurement matrix $A = M\odot\mathcal{F} \in \mathbb{C}^{m \times n}$ models undersampled acquisition, where $m \ll n$, $\mathcal{F}$ represents the discrete Fourier transform, and ${M}$ is a binary sampling mask in k-space. 
$x$ is the desired image and $\epsilon$ is the complex Gaussian noise. The $y$ is the measurement signal in k-space.  
To address the complexity of MRI image representation, we propose a sparse subspace diffusion model, which includes 1) coupled convolutional dictionary learning to decompose each image into multiple sparse and independent subspaces [Fig.~\ref{fig:overview}(a)] and 2) diffusion modeling in subspace to train an unconditional diffusion model [Fig.~\ref{fig:overview}(b)]. 

\paragraph{Convolutional dictionary learning:} We employ convolutional dictionary learning (CDL) as an adaptive sparse transform to transfer MR images acquired at arbitrary acceleration factors into a unified multi-channel sparse domain. We unfold the dictionary learning process through the Learned Iterative Shrinkage-Thresholding Algorithm (LISTA)~\cite{gregor2010learning}. To enhance the model's adaptability, we replace the original learnable soft-thresholding function with a spatial attention module that adaptively determines the threshold for each pixel based on the input signal. This design allows the dictionary learning model to adapt to image sparse decomposition under arbitrary acceleration factors.
Each sparse channel can then be inversely transformed to the image space through the $i$-th single convolutional kernel in $\mathcal{R}$, forming a set of sparse subspaces that capture distinct structural features. We introduce the $i$-th subspace $s^{(i)}\in\mathbb{C}^{C\times{H}\times{W}} $ as,
\begin{equation}\label{eq:2}
    s^{(i)} = \mathcal{R}^{(i)}\big(\mathcal{D}(x)\big), \forall i \in [1,L],
\end{equation}
where $\mathcal{D}$ denotes the CDL-based adaptive sparse transform~\cite{Deng_2020_DeepcoupledISTAnetworkmulti-modalimagesuper-resolution, yang2025improving} that transforms an image into the sparse domain. $\mathcal{R}$ represents the reconstruction module that includes a transposed convolution operation to transform the sparse representation back to the image domain, and $\mathcal{R}^{(i)}$ corresponds to the transposed convolutional operation with the $i$-th convolutional kernel. 

To establish a coupling relationship between the undersampled and fully sampled subspaces in k-space based on the measurement-observed locations, we introduce a channel-wise coupling constraint. This coupling is governed by the sampling mask, which enforces alignment between the two subspaces at observed locations while actively suppressing the signal generation at unobserved positions. Furthermore, an orthogonality constraint is imposed among the sparse subspaces to enforce subspace decoupling and eliminate redundant information in the sparse representation. The overall CDL algorithm is given as,
\begin{equation}\label{eq:3}
    \min_{\mathcal{R}, \mathcal{D}} \parallel{\mathcal{R}(\mathcal{D}(x))-x}\parallel_{2}^{2} + \parallel{\mathcal{D}(x)\parallel_{1} + \mathcal{L}_{\mathcal{C}}+\mathcal{L}_{\perp}}
\end{equation}

The orthogonal constraint $\mathcal{L}_{\perp}$ minimizes the cosine angles amongst the subspaces $s$ in the vector space. The coupling constraint $\mathcal{L}_{\mathcal{C}}$ is written as,
\begin{equation}
\mathcal{L}_{\mathcal{C}}
= \mathcal{L}_{\Omega} + \mathcal{L}_{u},
\label{eq:cpl_total}
\end{equation}
which includes two terms. The first term, $\mathcal{L}_{\Omega}$, couples the signals in k-space based on the observation location along the channel dimension, across the under-sampled and full-sampled images.  $\mathcal{L}_{\Omega}$ can be written as, 

\begin{equation}\label{eq:5}
    \mathcal{L}_{\Omega} = \big\| M \odot \big(\mathcal{F}({s}_u) -\mathcal{F}(s_f)\big) \big\|_2^2
\end{equation}
 where $s_u$ and $s_f$ denote the subspace of the undersampled image and the full-sampled image, respectively. To further constrain the representation in each sparse subspace, we introduce $\mathcal{L}_{u}$, which penalizes spurious energy in the unobserved k-space regions, given as,
\begin{equation}\label{eq:6}
    \mathcal{L}_{u}=\parallel{A(\sum_{i}^{L}s^{(i)}) - y}\parallel_{2}^{2}+\lambda_u \sum_{i}^{L} \big\| (1 - M) \odot \mathcal{F}(s^{(i)}) \big\|_2^2.
\end{equation}
where $\lambda_{u}$ controls the strength of energy suppression in the unobserved k-space regions for each subspace. The first-stage CDL is fully trained before the second-stage diffusion modeling.


\noindent


\paragraph{Diffusion modeling in subspace:} At the second stage, we model the $i^{th}$ subspace $s^{(i)}$ using a diffusion model,
% \begin{equation}
%     q(s_{1:T}|s_{0})=\prod_{i=1}^{L}\prod_{t=1}^{T}q(s_{t}^{(i)}|s_{t-1}^{(i)})
% \end{equation}
\begin{equation}
p_\theta(s^{(i)})
=\int{p_{\theta}(s^{(i)}_{T})}\prod _{t=1}^{T}p_{\theta, t}\!\left(s^{(i)}_{t-1}\mid s^{(i)}_{t}\right)
    \, d s^{(i)}_{1:T}.
\end{equation}
Unfolding the Markovian forward process as follows, 
\begin{equation}
    q(s_{t}^{(i)}|s_{0}^{(i)})=\mathcal{N}(s_{t}^{(i)};\sqrt{\bar{\alpha}_t}s_{0}^{(i)}, (1-\bar{\alpha}_t)I),\ \bar{\alpha}_t=\prod^{t}_{\tau=1}\alpha_{\tau}
\end{equation}
where the noise schedule $\alpha$ is a decreasing sequence of $t$. The diffusion model is trained by,
\begin{equation}
    \min_{\theta}\;
\sum_{i=1}^{L}\mathbb{E}_{\epsilon \sim \mathcal{N}(0,{I})}
\big[\, \|\, \epsilon_{\theta}(s^{(i)}_{t}, t) - \epsilon^{(i)} \,\|_2^2 \,\big] + \lambda\mathcal{L}_{r}
\end{equation}
where $\epsilon_{\theta, t}$ is the model’s predicted noise at timestep $t$ for $i$-th subspace, and $\epsilon$ is the actual noise added, and $\mathcal{L}_{r}=\parallel{\sum^{L}_{i=1}\hat{s}^{(i)}_{0}-x_{f}}\parallel_{2}^{2}$, where $\hat{s}^{(i)}_0=s_{t}^{(i)}-\frac{1-\alpha_{t}}{\sqrt{1-\bar{\alpha}}}\hat{\epsilon}_{\theta}$ and $x_f$ is full-sampled image.

In the sampling process, we find a subspace $s$ that is consistent with the measurement signal $y$,
\begin{equation}\label{eq:10}
    \tilde{s} = {\min_{s}}\parallel{y - A(\sum^{L}_{i}s^{(i)})}\parallel_{2}^{2}
\end{equation}
where $\tilde{s}$ is the subspace that include $\tilde s^{(1)}, ..., \tilde s^{(L)}$. While Eq.~(\ref{eq:10}) ensures global fidelity via measurement consistency constraint, applying it directly to decoupled subspaces suffers from a one-to-many mapping ambiguity, often leading to unconstrained drift in individual generation trajectories. 
 Here, we adopt a coarse-to-fine strategy to address this issue. We first establish a preliminary subspace-consistent initialization, as follows:
\begin{equation}\label{eq:11}
    \tilde{s}^{(i)} = \min_{s}(\parallel{b^{(i)}-As^{(i)}}\parallel_{2}^{2} + \parallel{s^{(i)}-\hat{s}^{(i)}_{0}}\parallel_{2}^{2})
\end{equation}
where $b^{(i)}$ is the k-space of the $i^{th}$ subspace extracted from the undersampled image, $\hat{s}^{(i)}_{0}$ is computed from Tweedie's formula~\cite{song2020denoising}. The above step anchors the optimization by initializing each subspace to their corresponding measured signals before applying the global constraint in Eq.~(\ref{eq:10}). 

In addition, to mitigate the risk of converging to suboptimal local minima, we use the stochastic resampling strategy from~\cite{song2024solving}, as expressed in the following formulation:

\begin{equation}
    p\big({s}^{(i)}_{t}\mid\tilde{s}^{(i)}_{t},\tilde{s}^{(i)}_{0}(y),y\big)
        = \mathcal{N}\Bigg(
            \frac{\sigma_t^{2}\sqrt{\bar{\alpha}_{t}}\,\tilde{s}^{(i)}_{0}
            + (1-\bar{\alpha}_{t})\, \tilde s^{(i)}_{t}}
            {\sigma_t^{2} + (1-\bar{\alpha}_{t})},
            \quad
            \frac{\sigma_t^{2}(1-\bar{\alpha}_{t})}
            {\sigma_t^{2} + (1-\bar{\alpha}_{t})} I
        \Bigg)
\end{equation}
where $\sigma_t^2= \gamma \cdot\frac{1-\alpha_{t-1}}{\alpha_{t}}\cdot\frac{1-\alpha_{t}}{\alpha_{t-1}}$ and $\gamma$ represents hyperparameter to control $\sigma^{2}$, $s_{t}^{(i)}$ represent posterior sampling based on $\tilde{s}^{(i)}$ as shown in Algorithm \ref{alg:ours}. The physics-informed guided diffusion generates the subspace ${s}^{(i)}$, 
and the final reconstruction is obtained by summing all subspaces as 
${\hat{x}} = \sum_{i=1}^{L}{s}^{(i)}$.
\section{Experiments}
% \textbf{Experiments and Results}
% \subsection{Experiment Setting}

\textbf{Experimental Setup.} The experiments are conducted using the \textit{fastMRI} dataset~~\cite{zbontar2018fastmri}. 
The fastMRI data consist of multi-coil \textit{k}-space measurements for a brain image and a single-coil knee image. We transform multi-coil brain images into coil-combine images following the procedure in~\cite {Sriram_2020_End-to-EndVariationalNetworksAcceleratedMRIReconstruction}. 
Specifically, the coil-combine image $x$ is obtained through the corresponding sensitivity maps $S_i$ as,
    $x
    = \sum_{j=1}^{N} S_j^{*} x_j,$
where $S_j$ denotes the coil sensitivity map estimated via the \textit{ESPIRiT} algorithm \cite{uecker2014espirit}, with an auto-calibration region (ACR) factor of 8\%. 
The brain and knee images are cropped to a spatial resolution of $320\times320$, and the complex-valued images are split into real and imaginary components, resulting in an input size of $2\times320\times320$ for the experiments. All images are normalized to the magnitude of the complex value in [0, 1]; thus, the real and imaginary channels are scaled to [-1, 1]. In the experiments, two acceleration factors are evaluated (\(8\times\) and \(16\times\)). 
For all acceleration factors, the auto-calibration region (ACR) fraction was set to 4\%. 
The corresponding sampling masks were generated following the official \textit{fastMRI} sampling protocol~\cite {zbontar2018fastmri}. Specifically, a fully sampled central k-space region was first retained as the ACR, and the remaining k-space lines outside the ACR were undersampled using the equispaced or random patterns for the brain and knee images. For each module, we train the model for 100 epochs on the entire brain and knee image training dataset. Testing is conducted on 200 randomly selected samples from the brain and knee validation dataset, with one slice extracted from each subject. All modules are trained and tested on a single NVIDIA A100 GPU.


\textbf{Evaluation Metrics.} 
The reconstruction quality was quantitatively evaluated using three standard metrics: 
peak signal-to-noise ratio (PSNR) in dB, structural similarity index (SSIM), 
and mean absolute error (MAE). For all metrics, higher PSNR and SSIM values and lower MAE indicate better reconstruction performance. The reported results were computed on magnitude images for both the brain and the knee datasets. For the brain dataset, the evaluation was performed only within the foreground brain region, whereas for the knee dataset, metrics were computed over the entire image. The final results were averaged over the test dataset.


\begin{algorithm2e}[!t]
\caption{SSDM MRI Reconstruction}
\label{alg:ours}
\DontPrintSemicolon
\LinesNotNumbered
\SetInd{0.5em}{1.5em}

\KwIn{Measurements $y$. Subspaces measurements $[b^{(1)}, ...,b^{(L)}]$. Sampling mask $M$. Forward operator $A(\cdot,M)$. Stochastic Resample Hyperparameter $\gamma$.
Score function $\epsilon_\theta(\cdot,t)$. Noise schedule $\alpha_t$. 
DDIM parameter $\eta$. Perform resample frequency $C$.}
\KwOut{$x_0$, reconstructed image}

$s^{(i)}_T\sim \mathcal{N}(0,I), \quad{i=1,...,L}$\hfill{$\triangleright$ Initial subspaces}

\For{$t \leftarrow T-1$ \KwTo $0$}{
    \For{$i \leftarrow 1$ \KwTo $L$ \KwSty{in parallel}}{
        $\hat\epsilon^{(i)}_t = \epsilon_{\theta}(s^{(i)}_t, t)$\hfill{$\triangleright$ Prediction the score at time-step $t$}

        $\hat s^{(i)}_0 = 
            \dfrac{s^{(i)}_{t}-\sqrt{1-\alpha_{t}}\,\hat\epsilon^{(i)}_t}{\sqrt{\alpha_{t}}}$\hfill{$\triangleright$ Computing $\hat s_0$ using Tweedie’s formula}
    
        % $\tilde s^{(i)}_0 = 
        %     \mathrm{CG}({A}, b^{(i)}, \hat s^{(i)}_0)$\hfill{$\triangleright$ Data consistent in Subspaces}
        $\tilde{s}^{(i)}_{0} = \min_{s}(\parallel{b^{(i)}-As^{(i)}}\parallel_{2}^{2} + \parallel{s^{(i)}-\hat{s}^{(i)}_{0}}\parallel_{2}^{2})$\hfill{$\triangleright$ Data consistent in Subspaces}
        
        $\tilde{s}^{(i)}_t = 
            \sqrt{\bar{\alpha}_t}\,\tilde s^{(i)}_{0}
            + \sqrt{1-\bar\alpha_t}\left(\eta\epsilon_t + \sqrt{\,1-\eta^2\,}\hat\epsilon_t\right)$\hfill{$\triangleright$ Unconditional DDIM step}
        }
        \If{$t \in C$}{
            \For{$i \leftarrow 1$ \KwTo $L$ \KwSty{in parallel}}{
                $\tilde s^{(i)}_0(y) = 
                \min_{s} \|y - A(\sum_{i}^{L}(\tilde s^{(i)}_0),M)\|_2^2$\hfill{$\triangleright$ Solve $\tilde{s}_0$ in image domain}
    
                $s^{(i)}_t = 
                    \mathrm{StochasticResample}(\tilde s^{(i)}_0(y), \tilde{s}^{(i)}_t, \gamma)$\hfill{$\triangleright$ Map back to $t$}
            }
    }
    \Else{
        $s_t = \hat{s}_t(b)$\hfill{$\triangleright$ No Resampling}
    }
}

$\hat{x}_0 \leftarrow \sum_{i}^{L}(s^{(i)}_0)$ \hfill{$\triangleright$ Output reconstruction}
\end{algorithm2e}

\textbf{Implementation Details.} 
The training of SSDM follows a two-stage process. In the first stage, we train a convolutional dictionary learning (CDL) module for self-reconstruction, following the work \cite{yang2025improving}. The CDL module is implemented using a LISTA algorithm unfolded into 12 blocks. In the forward process of the CDL, the input image is transferred into an 8-channel sparse domain and then reconstructed back to the image domain via a reconstruction module $\mathcal{R}$, as shown in Eq.(\ref{eq:3}).
\begin{equation}
    z^{(j+1)}=\text{soft}_{\rho(j)}
\big(z^{(j)}+\mathcal{W}_{dec}x -\mathcal{W}_{dec}\mathcal{W}_{conv}z^{(j)}+\mathcal{Q}_{2}^{(j)}\phi({\mathcal{Q}_{1}^{(j)}}z^{(j)})\big)
\end{equation}
where $\mathcal{W}_{\text{dec}}$ and $\mathcal{W}_{\text{conv}}$ are convolution operations.
$\mathcal{Q}_{2}$ and $\mathcal{Q}_{1}$ are $1 \times 1$ convolution operations in the channel dimension, and $\phi$ is the \texttt{imGeLU} function. $\rho{(j)}$ is a learnable soft-threshold using spatial attention that determines the adaptive threshold for each pixel based on the input signal, enabling CDL to process the image with any sampling pattern adaptively. Then, CDL provides unified sparse representations for both fully sampled and undersampled images with any acceleration factor.

In the second stage, we train an unconditional DDPM on the fully sampled subspaces extracted from the frozen CDL module, and we adopt the standard linear noise scheduler defined in DDPM \cite{ho2020denoising}. Specifically, these subspaces are arranged along the batch dimension as independent training samples, with their complex-valued components separated into distinct channels. Utilizing a U-Net~\cite{dhariwal2021diffusion} as our backbone, we introduce a cross-batch attention mechanism inspired by~\cite{luo20253denhancer} to bridge independent subspaces. To balance performance and computational cost, this attention module is strategically placed only in the middle block and the final encoder block. 
During inference, the undersampled images are first decomposed into sparse subspaces using the CDL module. These subspaces of undersampled image representations encode physics-related constraints and guide the pretrained diffusion model to generate the corresponding subspaces of the reconstructed image. The inference process is detailed in Algorithm \ref{alg:ours}. In this algorithm, Eq. (\ref{eq:11}) is solved using conjugate gradient descent, whereas Eq. (\ref{eq:10}) is solved via standard gradient descent, with its gradients automatically computed through PyTorch’s autograd mechanism.

\textbf{Comparing Baseline Methods.} We compared the proposed method with four state-of-the-art diffusion-based methods, including the decomposed diffusion sampler (DDS)~\cite{chung2024decomposed}, diffusion posterior sampling (DPS)~\cite{chung2022diffusion}, denoising diffusion null-space model (DDNM)~\cite{wang2022zero}, and Score-Based diffusion models (Score-MRI)~\cite{chung2022score}. We further compared with traditional compressive sensing with total variation (TV) constraints.

\begin{figure}[!t]
  \centering
  \includegraphics[width=\linewidth, keepaspectratio]{Figures/T2Brain.pdf}
  \caption{
    Testing T2-weighted (T2W) brain reconstructions with corresponding error maps
    at acceleration factors of $8\times$ and $16\times$.
    The proposed method demonstrates improved structural preservation and reduced aliasing artifacts compared with baseline methods. All error maps are scaled to [0, 0.3] for display.
  }
  \label{fig:t2brain}
\end{figure}

\begin{figure}[!t]
  \centering
  \includegraphics[width=\linewidth, keepaspectratio]{Figures/T1Brain.pdf}
  \caption{
    Testing T1-weighted (T1W) brain reconstructions with corresponding error maps
    at acceleration factors of $8\times$ and $16\times$.
    The proposed method demonstrates improved structural preservation and reduced aliasing artifacts compared with baseline methods. All error maps are scaled to [0, 0.3] for display.
  }
  \label{fig:t1brain}
\end{figure}

\begin{figure}[!t]
  \centering
  \includegraphics[width=\linewidth, keepaspectratio]{Figures/T2Knee.pdf}
  \caption{
    Testing T2-weighted (T2W) knee image reconstructions with corresponding error maps
    at acceleration factors of $8\times$ and $16\times$.
    The proposed method demonstrates improved structural preservation and reduced aliasing artifacts compared with baseline methods. All error maps are scaled to [0, 0.3] for display.
  }
  \label{fig:t2knee}
\end{figure}

\section{Results}
 The results show both the baseline and our proposed reconstructions for acceleration factors of 8×, and 16×. Representative examples of brain and knee images are given in Figure~\ref{fig:t2brain}, \ref{fig:t1brain}, \ref{fig:t2knee}. Quantitative evaluations were conducted using SSIM, PSNR, and MAE metrics, as shown in Table~\ref{tab:table1}, \ref{tab:table2}. Qualitative comparison results for brain MRI at $8\times$ and $16\times$ acceleration factors are shown in Fig.~\ref{fig:t2brain} and Fig.~\ref{fig:t1brain}. At $8\times$ acceleration, baseline diffusion reconstruction methods can generate visually reasonable results, but tend to attenuate fine gyral patterns and blur tissue interfaces in Fig.~\ref{fig:t2brain}. Under more aggressive $16\times$ undersampling conditions, these limitations become evident: DDS and Score-MRI exhibit increased smoothing of cortical structures and distinct structural artifacts.  As reflected in the corresponding error maps in Fig.~\ref{fig:t2brain}, errors in these baseline methods are highly correlated with anatomical boundaries. In contrast, SSDM better preserves cortical continuity and fine-grained tissue contrast in both T2- and T1-weighted brain images. 
 
 Results for the T2-weighted knee experiment are shown in Fig.~\ref{fig:t2knee}. Conventional reconstruction methods and DPS exhibit pronounced over-smoothing and residual aliasing artifacts at both $8\times$ and $16\times$ acceleration, leading to blurred cartilage boundaries and loss of fine anatomical details in Fig.~\ref{fig:t2knee}. While other diffusion-based methods (DDNM, Score-MRI, and DDS) can effectively suppress aliasing, they tend to introduce structural flattening or texture over-smoothing, which becomes particularly apparent at $16\times$ acceleration along cartilage surfaces and joint contours. By comparison, SSDM maintains sharp cartilage boundaries and joint morphology across both acceleration settings, achieving closer visual correspondence with the reference images. The accompanying error maps further demonstrate that SSDM produces minimal residual errors. Unlike baseline methods, which exhibit prominent structure-correlated errors outlining the anatomy, SSDM's residuals are significantly lower and lack distinct structural patterns.

% \begin{table}[!t]
% \centering
% \caption{Quantitative comparison of methods under different acceleration factors for brain images.}
% \resizebox{0.7\linewidth}{!}
% {
% \begin{tabular}{lcccc}
% \toprule
% \textbf{Acceleration} & \textbf{Method} & \textbf{PSNR (dB)} & \textbf{SSIM} & \textbf{MAE ($\times10^{-3}$)} \\
% \midrule
% \multirow{4}{*}{$8\times$}
%  & TV & 20.18 & 0.757 & 57.0 \\
%  & DPS & 26.67 & 0.805 & 54.3 \\
%  & DDNM & 31.41 & 0.836 & 49.2 \\
%  & Score-MRI & 32.08 & 0.813 & 54.5 \\
%  & DDS & 33.65 & 0.923 & 41.2 \\
%  & \textbf{SSDM} & \textbf{33.85} & \textbf{0.927} & \textbf{40.5} \\
% \midrule
% \multirow{4}{*}{$16\times$}
%  & TV & 18.30 & 0.734 & 55.6 \\
%  & DPS & 21.37 & 0.798 & 55.1 \\
%  & DDNM & 23.18 & 0.791 & 54.1 \\
%  & Score-MRI & 27.78 & 0.773 & 58.9 \\
%  & DDS & 31.61 & 0.900 & 49.7 \\
%  & \textbf{SSDM} & \textbf{32.02} & \textbf{0.905} & \textbf{48.1} \\
% \bottomrule
% \end{tabular}
% }
% \label{tab:quant2}
% \end{table}

\begin{table}[!t]
\centering
\caption{Quantitative comparison of methods for brain images {(mean±std)} at $8\times$ and $16\times$ acceleration.}

\setlength{\tabcolsep}{3pt}        % 缩小列间距
\renewcommand{\arraystretch}{0.9}  % 缩小行距
\small                              % 表格整体用小一号字体

\begin{tabular}{lcccccc}
\toprule
\multirow{2}{*}{\textbf{Method}} 
 & \multicolumn{2}{c}{\textbf{PSNR (dB)}} 
 & \multicolumn{2}{c}{\textbf{SSIM}} 
 & \multicolumn{2}{c}{\textbf{MAE ($\times10^{-3}$)}} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7}
 & \textbf{$8\times$} & \textbf{$16\times$}
 & \textbf{$8\times$} & \textbf{$16\times$}
 & \textbf{$8\times$} & \textbf{$16\times$} \\
\midrule
TV         & 20.18 $\pm{\ {2.9}}$& 18.30 $\pm{\ {2.6}}$& 0.757 $\pm{\ {0.064}}$& 0.734 $\pm{\ {0.066}}$& 57.0 $\pm{\ {18.2}}$& 55.6 $\pm{\ {19.3}}$\\
DPS        & 26.67 $\pm{\ {1.6}}$& 21.37 $\pm{\ {1.5}}$& 0.805 $\pm{\ {0.044}}$& 0.798 $\pm{\ {0.041}}$& 54.3 $\pm{\ {16.4}}$& 55.1 $\pm{\ {16.2}}$\\
DDNM       & 31.41 $\pm{\ {1.8}}$& 23.18 $\pm{\ {1.8}}$& 0.836 $\pm{\ {0.039}}$& 0.791 $\pm{\ {0.038}}$& 49.2 $\pm{\ {6.2}}$& 54.1 $\pm{\ {14.2}}$\\
Score-MRI  & 32.08 $\pm{\ {2.6}}$& 27.78 $\pm{\ {2.7}}$& 0.813 $\pm{\ {0.052}}$& 0.773 $\pm{\ {0.049}}$& 54.5 $\pm{\ {19.7}}$& 58.9 $\pm{\ {18.7}}$\\
DDS        & 33.65 $\pm{\ {2.4}}$ & 31.61 $\pm{\ {2.3}}$& 0.923 $\pm{\ {0.025}}$ & 0.900 $\pm{\ {0.030}}$& 41.2 $\pm{\ {10.9}}$ & 49.7 $\pm{\ {13.9}}$\\
\textbf{SSDM} & \textbf{33.85} $\pm{\ \textbf{2.5}}$ & \textbf{32.02} $\pm{\ \textbf{2.6}}$& \textbf{0.927} $\pm{\ \textbf{0.025}}$ & \textbf{0.905} $\pm{\ \textbf{0.033}}$& \textbf{40.5} $\pm{\ \textbf{10.7}}$ & \textbf{48.1} $\pm{\ \textbf{12.9}}$\\
\bottomrule
\end{tabular}
\label{tab:table1}
\end{table}



% \begin{table}[!t]
% \centering
% \caption{Quantitative comparison of methods under different acceleration factors for knee images. }
% \resizebox{0.7\linewidth}{!}{
% \begin{tabular}{lcccc}
% \toprule
% \textbf{Acceleration} & \textbf{Method} & \textbf{PSNR (dB)} & \textbf{SSIM} & \textbf{MAE ($\times10^{-3}$)} \\
% \midrule
% \multirow{4}{*}{$8\times$}
%  & TV & 21.5 & 0.68 & 40.5 \\
%  & DPS & 24.51 & 0.779 & 30.1  \\
%  & DDNM & 29.35 & 0.787 & 26.0  \\
%  & Score-MRI & 29.12 & 0.786 & 28.7 \\
%  & DDS & 29.45 & 0.795 & 25.2 \\
%  & \textbf{SSDM} & \textbf{29.69} & \textbf{0.804} & \textbf{24.6} \\
% \midrule
% \multirow{4}{*}{$16\times$}
%  & TV & 21.3 & 0.671 & 41.5 \\
%  & DPS & 21.89 & 0.723 & 28.2  \\
%  & DDNM & 27.97 & 0.775 & 26.5  \\
%  & Score-MRI & 28.37 & 0.766 & 27.4 \\
%  & DDS & 29.13 & 0.780 & 26.2 \\
%  & \textbf{SSDM} & \textbf{29.21} & \textbf{0.785} & \textbf{26.0} \\
% \bottomrule
% \end{tabular}
% }
% \label{tab:quant1}
% \end{table}

\begin{table}[!t]
\centering
\caption{Quantitative comparison of methods for knee images {{(mean±std)}} at $8\times$ and $16\times$ acceleration.}

\setlength{\tabcolsep}{3pt}      % 缩小列间距（默认 6pt）
\renewcommand{\arraystretch}{0.9} % 缩小行距（默认 1.0）
\small                            % 整个表用小一号字体

\begin{tabular}{lcccccc}
\toprule
\multirow{2}{*}{\textbf{Method}} 
 & \multicolumn{2}{c}{\textbf{PSNR (dB)}} 
 & \multicolumn{2}{c}{\textbf{SSIM}} 
 & \multicolumn{2}{c}{\textbf{MAE ($\times10^{-3}$)}} \\
\cmidrule(lr){2-3} \cmidrule(lr){4-5} \cmidrule(lr){6-7}
 & \textbf{$8\times$} & \textbf{$16\times$}
 & \textbf{$8\times$} & \textbf{$16\times$}
 & \textbf{$8\times$} & \textbf{$16\times$} \\
\midrule
TV         & 21.50 $\pm{\ {3.8}}$& 21.30 $\pm{\ {4.7}}$& 0.680 $\pm{\ {0.235}}$& 0.671 $\pm{\ {0.248}}$& 40.5 $\pm{\ {26.6}}$& 41.5 $\pm{\ {27.4}}$\\
DPS        & 24.51 $\pm{\ {2.7}}$& 21.89 $\pm{\ {3.9}}$& 0.779 $\pm{\ {0.138}}$& 0.723 $\pm{\ {0.169}}$& 30.1 $\pm{\ {25.9}}$& 28.2 $\pm{\ {25.3}}$\\
DDNM       & 29.35 $\pm{\ {3.6}}$& 27.97 $\pm{\ {3.5}}$& 0.787 $\pm{\ {0.135}}$& 0.775 $\pm{\ {0.139}}$& 26.0 $\pm{\ {14.0}}$& 26.5 $\pm{\ {11.5}}$\\
Score-MRI  & 29.12 $\pm{\ {3.1}}$& 28.37 $\pm{\ {3.1}}$& 0.786 $\pm{\ {0.134}}$& 0.766 $\pm{\ {0.136}}$& 28.7 $\pm{\ {14.6}}$& 27.4 $\pm{\ {12.3}}$\\
DDS        & 29.45 $\pm{\ {3.4}}$& 29.13 $\pm{\ {3.6}}$& 0.795 $\pm{\ {0.148}}$& 0.780 $\pm{\ {0.146}}$& 25.2 $\pm{\ {15.6}}$& 26.2 $\pm{\ {14.1}}$\\
\textbf{SSDM} & \textbf{29.69} $\pm{\ \textbf{3.2}}$& \textbf{29.21} $\pm{\ \textbf{3.5}}$& \textbf{0.804} $\pm{\ \textbf{0.129}}$& \textbf{0.785} $\pm{\ \textbf{0.134}}$& \textbf{24.6} $\pm{\ \textbf{12.4}}$& \textbf{26.0} $\pm{\ \textbf{13.8}}$\\
\bottomrule
\end{tabular}
\label{tab:table2}
\end{table}




% *** DISCUSSION / CONCLUSION ***
\section{Discussion and Conclusion}
This work introduces a two-stage diffusion-based MRI reconstruction framework that operates in a {{learnable sparse}} space.
By integrating coupled convolutional dictionary learning with generative diffusion modeling, the proposed method enables subspace-consistent reconstruction under measurement constraints, thereby improving fine-detail preservation and suppressing reconstruction artifacts.

Aliasing artifacts in accelerated MRI arise from energy leakage of underlying signals, leading to severe entanglement of structural information under highly undersampled conditions and posing a fundamental challenge for both discriminative and generative reconstruction models.
By decomposing the complex image representation into multiple sparse subspaces, our framework promotes effective disentanglement of signal components. It supports faithful subspace-wise reconstruction, mitigating the risk of generating perceptually plausible but feature-inconsistent results. {{We note that this framework targets a complementary regime to fully supervised regression or unrolled reconstruction methods, focusing on unconditional generative priors under extreme undersampling rather than paired input–output supervision.}}

However, the current implementation is limited to single-coil 2D Cartesian acquisitions.
Future work will extend the framework to multi-coil and dynamic MRI settings and further investigate richer learnable physics-informed spaces to incorporate more expressive physical priors.
Overall, this study demonstrates that coupling diffusion priors with structured sparse subspace representations provides a practical pathway for leveraging physical signal properties in the transform space, thereby enhancing MRI reconstruction quality.


\clearpage  % Acknowledgements, references, and appendix do not count toward the page limit (if any)
% Acknowledgments---Will not appear in anonymized version
% \midlacknowledgments{We thank a bunch of people.}

\bibliography{midl-samplebibliography}

\clearpage 
\appendix
\section{Ablation Study}
\subsection{Analysis of Stage 1: Convolutional Dictionary Learning}

We investigated the trade-offs among sparse representation capacity, sparse features as shown in \ref{fig:fig5} and \ref{fig:fig6}, computational efficiency, and feature expressiveness by analyzing the sparse channel configuration and the thresholding strategy.

\paragraph{1. Sparse Channel Capacity and Dead Channels}
We evaluated the impact of the sparse channel dimension ($N_c$) by testing configurations with $N_c \in \{4, 8, 16\}$.
\begin{itemize}
    \item \textbf{Low Channel Count ($N_c=4$):} We observed that reducing the number of channels restricts the model's ability to decompose complex signals into sufficiently simplified components. The limited capacity forces the model to encode dense information into fewer channels, resulting in a complex representation in sparse features.
    \item \textbf{High Channel Count ($N_c=16$):} While increasing the number of channels theoretically enhances representational capacity, we observed the \textbf{'dead channel'} phenomenon, in which a subset of filters fails to capture meaningful signals, leading to persistent inactivity. Furthermore, a larger $N_c$ significantly increases the computational burden during the second-stage diffusion process without yielding proportional performance gains.
    \item \textbf{Optimal Setting ($N_c=8$):} We found that $N_c=8$ strikes an optimal balance, maintaining excellent sparsity properties and representation quality while avoiding dead channels and high computational costs.
\end{itemize}

\paragraph{2. Sparsity-Fidelity Trade-off via Soft-Thresholding}
We analyzed the effect of the initial threshold value in the learnable soft-thresholding function on the sparsity rate.
\begin{itemize}
    \item \textbf{Increasing the Threshold:} By raising the initial threshold for the learnable soft-threshold function, we forced more pixel signals to be suppressed to zero, effectively increasing the sparsity rate of the latent representation.
    \item \textbf{Impact on Reconstruction:} However, we found that the image representation relies heavily on the expressiveness of these sparse features. Excessive suppression (via a high threshold) limits this expressiveness, degrading the quality of self-reconstruction. Therefore, the threshold is initialized at $1$ and scales down by a factor of 0.8 for each subsequent block.
\end{itemize}

% --- 第一张图 ---
\begin{figure}[hbt!]
    \centering
    \includegraphics[width=0.6\textwidth]{Figures/Source Sparse Coefficient.pdf} % 宽度可以自己调，比如 0.6 或 0.8
    \caption{The sparse features of the under-sampled image at $8\times$ acceleration.}
    \label{fig:fig5}
\end{figure}

% --- 第二张图 ---
\begin{figure}[hbt!]
    \centering
    \includegraphics[width=0.6\textwidth]{Figures/Target Sparse Coefficient.pdf}
    \caption{The sparse features of the fully sampled reference image.}
    \label{fig:fig6}
\end{figure}

\subsection{Analysis of Stage 2: Subspace Diffusion Model}

In the second stage, we validated the architectural design with respect to global consistency and the impact of the sampling trajectory.

\paragraph{1. Cross-Batch Attention and Manifold Consistency}
We investigated the critical role of the cross-batch attention mechanism in coordinating the generation of multiple subspaces.
\begin{itemize}
    Since the subspaces are decomposed parts of a single anatomical structure, their generation trajectories must be coupled to ensure they converge onto a unified image manifold. When the cross-batch attention was removed, the subspaces were generated independently without global guidance. This lack of coupling led to spatial misalignment between subspaces. The resulting aggregation of misaligned subspaces led to severe blurring and loss of coherence, confirming the necessity of an information-exchange pathway to ensure global spatial consistency.
\end{itemize}

\subsection{{Performance Statistics}}
{Table} 3 summarizes the model parameter count, the single-forward runtime of the backbone network, and the end-to-end GPU runtime per slice for different reconstruction methods. Due to their intrinsic inference mechanisms, the compared methods employ different numbers of sampling steps. Specifically, DPS and Score-MRI follow standard diffusion sampling with 1000 steps, whereas DDNM, DDS, and SSDM involve deterministic sampling strategies that require only 100 steps. All methods are evaluated under their native inference settings rather than enforcing a unified step count. Since DPS and DDNM use the same diffusion backbone as DDS, we do not separately report their model parameter size and NFE and instead provide only their end-to-end runtimes for comparison. All experiments are conducted on a single NVIDIA A100 GPU.

Although Score-MRI employs a lightweight diffusion backbone, its end-to-end runtime remains high (498.05 s/slice) due to the large number of sampling steps. In contrast, DDS substantially reduces the total number of network function evaluations (NFE) and achieves a significant speedup (13.30 s/slice) despite using a larger backbone. SSDM further extends DDS by introducing subspace-wise multi-stage sampling, which increases both the per-forward computational cost and the total NFE.

Overall, these results indicate that SSDM trades increased computational cost for improved reconstruction performance. It decomposes a complex reconstruction problem into multiple simpler subproblems solved in parallel, resulting in a runtime within the same order of magnitude as DDS but higher, while maintaining comparable model capacity.

\begin{table}[t]
\centering
\caption{Model complexity and inference efficiency comparison.}
\label{tab:performance}
\begin{tabular}{lccc}
\toprule
\textbf{Model} & \textbf{Parameters (M)} & \textbf{ms/NFE} & \textbf{Sec/Slice}\\
\midrule
DPS         &  - &  - &  199.68 \\
DDNM        &  - &  - & 12.0 \\
Score-MRI   & 61.4 & 117.66 ± 0.54&    498.05\\
DDS        & 361.4  &  95.82 ± 0.19 &   13.30 \\
SSDM(Ours)      & 367.4  & 281.57 ± 0.28 &   86.02 \\
\bottomrule
\end{tabular}
\end{table}

% This is a boring technical proof of


% \section{Proof of Theorem 2}

% This is a complete version of a proof sketched in the main text.

\end{document}
