

\section{Methodology}
\subsection{Background}
\textbf{Diffusion Models} are a class of generative models that learn to approximate complex data distributions by iteratively transforming random noise into structured samples. They operate through a two-phase process: a \textit{forward diffusion process}, in which Gaussian noise is incrementally added to training data over a sequence of steps, and a \textit{reverse denoising process}, where a network is trained to recover the original data by gradually removing the noise. This learned denoising procedure allows the model to sample from the target distribution by reversing the noising trajectory. The forward diffusion process is modeled as a Markov chain, where Gaussian noise is added at each time step. Given an initial data sample \( x_0 \), the noised sample at time step \( t \) is computed as:\begin{equation}
q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I)\end{equation} where \( \beta_t \) is the noise schedule controlling the amount of noise added at each step. The final state \( x_T \) approaches a pure Gaussian noise distribution.
To train the model, we optimize: 


\begin{equation}
L_{\text{cond}} = \mathbb{E}_{x_0, t, \epsilon, c} \left[ \| \epsilon - \epsilon_\theta(x_t, t, c) \|^2 \right]
\end{equation}where \( \epsilon_\theta(x_t, t, c) \) represents the model’s prediction of the noise at time \( t \) and condition \( c \). \\ 


\input{sec/data-figure}



\subsection{ Diffusion Backbone}

We build our framework on a Diffusion Transformer leveraging its capacity for large scale data and high-fidelity generation. We first pre-train DiT-XL/2 on approximately 2 million unlabeled chest radiograph patches, enabling the model to capture the semantic structure of thoracic anatomy and generate realistic CXR patches. To specialize the model for localized nodule synthesis, we finetune it on curated nodule patches using binary masks as spatial conditioning signals. These masks define the target region for nodule placement, allowing the model at inference to synthesize nodules at desired locations through inpainting while preserving surrounding lung structures as shown in Figure \ref{fig:Diffusion baseline generation}. To further improve adherence to the conditioning masks, we apply classifier-free guidance (CFG) during inference. 

\noindent \textbf{Rationale for Separate LoRA Adapters:}
Using CFG to control multiple semantic labels simultaneously competes with mask-based conditioning and degrades boundary fidelity. Empirically, applying CFG to both masks and labels produced suboptimal results. Consequently, we adopt LoRA modules for characteristic control as they are computationally inexpensive while using CFG for mask conditioning. A comparison between CFG-based label control and our separate LoRA approach is provided in Appendix~\ref{sec:cfgvslora}.




\subsection{Characteristic-Specific LoRA Adapters}
% \label{sec:Characteristic-Specific LoRA Adapters}
We design characteristic-specific LoRA adapters for clinically relevant nodule attributes.  LoRA freezes the pre-trained weights and learns a compact set of rank-decomposed updates, significantly reducing the number of trainable parameters.  Given a pre-trained weight matrix \( W_0 \), LoRA parameterizes its update as \( \Delta W = A B \), where \( A \in \mathbb{R}^{d \times r} \) and \( B \in \mathbb{R}^{r \times k} \), with \( r \ll \min(d, k) \).  Each characteristic LoRA adapter is trained on samples curated for its respective characteristic, while the DiT-XL/2 backbone remains frozen. Qualitative results for individual characteristics are presented in Figure~\ref{fig:nodulecharwiselora}. For the subtlety attribute, we extend the Concept Sliders~\cite{gandikota-2023} approach to leverage the graded annotations in our dataset. Radiologists labeled each nodule on a discrete scale from 1 (most subtle) to 5 (most obvious). To incorporate this into training, we map the annotation level directly to the LoRA scale parameter ($\alpha$). Nodules with lower subtlety scores are assigned smaller $\alpha$ values, yielding weaker updates and less visible nodules, whereas higher scores correspond to larger $\alpha$ values, amplifying the updates and making nodules more apparent.
\begin{equation}
    W = W_{0} + \alpha(s) \, \Delta W
\end{equation}

\noindent where $s \in \{1,2,3,4,5\}$ is the radiologist-assigned subtlety score, and we take $\alpha(s) = 2^{2+s}$ as the LoRA scale which we found to work well empirically during subtlety slider training. During inference, we set $\alpha$ for generating nodules at  different subtlety levels as shown in Figure \ref{fig:scale_comparison}. 


\begin{figure}[t]
\centering
\begin{tabular}{ccc}
{\scriptsize (a) Subtlety ($\alpha$=24)} & {\scriptsize (b) Subtlety ($\alpha$=16)} & {\scriptsize (c) Subtlety ($\alpha$=12)} \\[4pt]
\includegraphics[width=0.15\linewidth]{char_wise_nod/subt/2.png} &
\includegraphics[width=0.15\linewidth]{char_wise_nod/subt/1.png} &
\includegraphics[width=0.15\linewidth]{char_wise_nod/subt/0.png} \\[-2pt]
\end{tabular}
\caption{Subtlety slider generations on a CXR patch. From left to right: decreasing LoRA scale $\alpha$ values produce more subtle nodules.}
\label{fig:scale_comparison}
\end{figure}



\subsection{Challenges in Multi-Characteristic Synthesis}

Conventional merging strategies such as linear merge and switching~\cite{zhong2024multiloracompositionimagegeneration} assume that LoRA updates combine uniformly across layers. In practice, this leads to suboptimal synthesis like one characteristic often dominates and artifacts often appear near mask boundaries. In general, LoRA weights are highly sparse (60--70\% near zero; $|w| < 0.01$), meaning a small subset of parameters drives perceptual change, making naive averaging brittle~\cite{ouyang2025kloraunlockingtrainingfreefusion}. Two factors contribute most: \textbf{overlapping attention regions} and \textbf{non-orthogonal updates}. Unlike natural-image settings where attributes may be spatially distinct, all adapters operate on the same nodule region, causing competing updates. % for example, calcification and homogeneity adapters generate overlapping attention that conflicts across timesteps.
 Our Frobenius-norm orthogonality analysis~\cite{Nakayama1952Orthogonality} further shows that independently trained adapters are not well separated, leading to correlated updates. This non-orthogonality causes interference regardless of the merging strategy employed.



\subsection{Orthogonality-Constrained Adapter Merging}
% To mitigate interference across multiple adapters, we encourage orthogonality between their weight matrices during training. Let $W_1$ and $W_2$ denote the LoRA weight updates for two different characteristics. We introduce an orthogonality regularizer based on the Frobenius norm:
To reduce interference, we promote orthogonality between adapter subspaces during training. For adapters with updates  $W_1$ and $W_2$, we add \begin{equation} \mathcal{L}_{\text{orth}} = \| W_1 W_2^\top \|_F^2 \end{equation}
and apply the term pairwise when using more than two adapters. Although the formulation naturally extends to merging multiple characteristics, in this work, we evaluate the setting of two-attribute combinations.
This loss penalizes correlations between the parameter subspaces of different adapters. We scale this term in the final loss using coefficient $\lambda$, allowing a trade-off between orthogonality and reconstruction fidelity.  With this constraint, adapters compose reliably: linear averaging suffices, and each characteristic remains controllable via its scalar $\alpha$. As illustrated in Figure~\ref{fig:lora-merge-rotated}, our merging follows the nodule mask better preserving its structure and follows the characteristics better than just inference time linear merging. We also analyse the orthogonality across 28 transformer layers shown in  Figure \ref{fig:orthogonality-layers}, which shows that with cross training the Frobenius norms are much lower compared to separately trained adapters.  






\begin{figure}[t]
    \centering
    \small
    \includegraphics[width=0.4\linewidth]{ortho_plot.png}
    \caption{Layer-wise Frobenius norm comparison of cross-trained vs.\ separately trained adapters across two characteristic pairs.}
    \label{fig:orthogonality-layers}
\end{figure}

\label{subsec:ortho}
