\documentclass{midl}

\usepackage{mwe}
\jmlrvolume{-- 132}
\jmlryear{2024}
\jmlrworkshop{Full Paper -- MIDL 2024}
\editors{Accepted for publication at MIDL 2024}

\title[LupusNet]{Lupus Nephritis Subtype Classification with only Slide Level labels}
\midlauthor{
\Name{Amit Sharma\midljointauthortext{Contributed equally}\nametag{$^{1}$}} \Email{amit.s@research.iiit.ac.in}\\
\addr $^{1}$ Center for Visual Information Technology, International Institute of Information Technology, Hyderabad, India\\
\Name{Ekansh Chauhan\midlotherjointauthor\nametag{$^{1}$}} \Email{ekansh.chauhan@research.iiit.ac.in}\\
\Name{Megha S Uppin\nametag{$^{2}$}} \Email{megha\_harke@yahoo.co.in}\\
\addr $^{2}$ Department of Pathology, Nizam’s Institute Of Medical Sciences, Hyderabad, India\\
\Name{Liza Rajasekhar\nametag{$^{3}$}} \Email{lizarajasekhar@gmail.com}\\
\addr $^{3}$ Department of Clinical Immunology and Rheumatology, Nizam’s Institute Of Medical Sciences, Hyderabad, India\\
\Name{C V Jawahar\nametag{$^{1}$}} \Email{jawahar@iiit.ac.in}\\
\Name{P K Vinod\nametag{$^{4}$}} \Email{vinod.pk@iiit.ac.in}\\
\addr $^{4}$ Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad, India
}

\begin{document}

\maketitle

\begin{abstract}
Lupus Nephritis classification has historically relied on labor-intensive and meticulous glomerular-level labeling of renal structures in whole slide images (WSIs). However, this approach presents a formidable challenge due to its tedious and resource-intensive nature, limiting its scalability and practicality in clinical settings. In response to this challenge, our work introduces a novel methodology that utilizes only slide-level labels, eliminating the need for granular glomerular-level labeling. A comprehensive multi-stained lupus nephritis digital histopathology WSI dataset was created from the Indian population, which is the largest of its kind. \textit{LupusNet}, a deep learning MIL-based model, was developed to classify LN subtypes. The results underscore its effectiveness, achieving an AUC score of 91.0\%, an F1 score of 77.3\%, and an accuracy of 81.1\% on our dataset in distinguishing membranous and diffused classes of LN.
\end{abstract}
\begin{keywords}
Lupus Nephritis, Weakly Supervised Learning, Whole Slide Image, Binary Classification
\end{keywords}

\section{Introduction}
\label{sec:intro}

Lupus Nephritis (LN) is one of the most severe manifestations of systemic lupus erythematosus (SLE), an autoimmune disease, due to its potential for severe renal damage and the intricate diagnostic and classification process. The complex nature of this disease is worsened by the substantial-high inter and intra-observer variability in histopathological renal biopsies \cite{dasari_chakraborty_truong_mohan_2019}. As some classes of LN exhibit varying levels of aggressiveness, a precise classification of these classes becomes crucial in assessing fatality risks, predicting long-term prognosis, and determining a practical therapeutic approach. 

Deep learning has recently emerged as a powerful tool in medical AI and healthcare, revolutionizing various aspects of medicine, from diagnosis and treatment to drug discovery and patient monitoring \cite{rajkomar_oren_chen_dai_hajaj_hardt_liu_liu_marcus_sun_et_al_2018}. Digital pathology has significantly advanced due to its capacity to extract intricate patterns and features from complex medical data \cite{WU2023100184, ahmed_abouzid_kaczmarek_2022}. Improvements in image analysis have led to significant advancements in various aspects of renal pathology, including automated detection and classification of glomerular lesions \cite{sheehan_korstanje_2018, ginley2019computational}, and identification of interstitial fibrosis \cite{zheng_cassol}. Advanced imaging techniques and molecular analyses may assist, but standardization and consensus in interpretation remain ongoing challenges.

Traditional LN classification follows a two-step process: first, identifying glomeruli types, then classifying LN based on these types, heavily dependent on detailed glomeruli annotations~\cite{sheehan_korstanje_2018, diagnostics11111983}. Yet, annotating glomeruli on large-scale WSIs is impractical in clinical settings due to their massive size and memory limitations, leading to patching and streaming solutions~\cite{Campanella2019, pinckaers2020streaming}. Previous studies mainly differentiated LN from non-LN, not addressing subtype classification~\cite{wang_xu_wang_wang_leng_fu_liu_qin_huang_2023}, which is complicated by similar glomerular types across subtypes and the unequal contribution of glomeruli to classification. \cite{cicalese2020kidney} proposed an end-to-end LN subtype classification method, but it required manual segmentation on mice biopsies, not directly applicable to human samples due to differences in physiology and pathology.

In contrast, our work simplifies this process by creating an end-to-end pipeline that does not necessitate reliance on glomeruli class labels at any intermediate stage. Multiple Instance Learning (MIL) has been extensively explored for other areas of digital histopathology \cite{Campanella2019}, but not much has been reported or explored in renal pathology.

While digital pathology has made strides, the LN classification research faces challenges such as access to the datasets and lack of consensus among medical professionals regarding its classification. In light of these considerations, the principal contributions of our work are as follows:
\begin{itemize}
  \item We focus on creating a valuable dataset of LN to drive research (computational and medical) in kidney diseases. This dataset, featuring multi-stained whole slide images, stands as one of the largest collections for lupus nephritis, a part of the consortium India Pathology Dataset (IPD) \footnote{\href{https://hai.iiit.ac.in/ipd/}{https://hai.iiit.ac.in/ipd/} }.
  
  \item We also introduce a novel architecture, LupusNet, an explainable MIL-based model that significantly improves LN subtype classification by integrating Gated and Multi-Head Attention, underscoring the critical requirement to learn the morphological differences between LN classes 4 \& 5.

    \item To the best of our knowledge, we present the first end-to-end pipeline for LN subtype classification by relying only on slide-level labels, eliminating two-step methods that relied on glomeruli labels, easing clinical workload and facilitating practical integration.

\end{itemize}

\section{Materials and Method}
\subsection{Data Acquisition \& Description}
In this study, biopsy specimens of 166 patients (retrospective and prospective cases) in different subclasses (ranging from 1 to 6) of LN from the Nizam Institute of Medical Sciences (NIMS) in Hyderabad, India, were digitalized. A total of 540 WSIs were digitalized using the Morphle Optimus 6X Scanner, with each WSI captured at a maximum magnification of 40x and stored in the widely used TIFF format. Slide-level labels depicting subtype classes for each of the cases were also recorded.

Within this repository of 540 WSIs, there are four distinct categories of stained images, specifically Hematoxylin and Eosin (H\&E), Periodic Acid-Schiff (PAS), methenamine silver Periodic Acid-Schiff (mt-PAS), and silver methenamine Periodic Acid-Schiff (sm-PAS). In this dataset, LN classes 4 (diffused proliferated) and 5 (membranous) exhibited the highest representation, with 62 and 53 cases, respectively. Class 4 LN displays a varied glomerular appearance characterized by widespread inflammation, cellular proliferation, and diverse lesions, whereas class 5 LN demonstrates a uniform appearance due to immune complex deposition, resulting in a membranous pattern \cite{weening2004classification}. Figure \ref{fig:3} shows glomerulus samples from our dataset for each class. Consequently, our study focused primarily on observations and results for these two prominent LN class classifications using PAS-stained slides highlight carbohydrates, glycogen, and glycoproteins, aiding the identification of renal structures.

This India region-specific dataset is created to support global collaboration in lupus nephritis research. It helps add diversity to the other existing cohort, offering insights into potential regional and ethnic variations in the disease.
\begin{figure}[ht]
\floatconts
  {fig:3}
  {\caption{Comparison of visual features between subtype samples. (a) involves proliferative changes in the glomeruli, whereas (b) shows thickening of the glomerular basement membrane}}
  {%
    \subfigure[Class 4 glomeruli]{\label{fig:image-a2}%
      \includegraphics[width=0.30\linewidth]{images/ln4_ex.png}}%
    \qquad
    \subfigure[Class 5 glomeruli]{\label{fig:image-b2}%
      \includegraphics[width=0.30\linewidth]{images/ln5_ex.png}}
  }
  \label{fig:3}
\end{figure}

\begin{figure*}[t!]
\floatconts
  {fig:example}
  {\caption{\textbf{LupusNet}: Proposed architecture for our lupus nephritis classifier. Gated attention identifies each glomerulus's importance, while multi-head attention (MHA) discerns their contextual relationships.}}
  {\includegraphics[width=\textwidth]{images/arch.png}}
\end{figure*}

\subsection{Methodology}
We aim to learn a function that can predict the presence or absence of a condition within a WSI based on its constituent patches. Mathematically, this problem can be defined as follows: We are provided with a dataset containing pairs of bag-labels $\{(X_i, Y_i)\}_{i=1}^D$.  Each $X_i$ represents a collection of instances (patches) within a bag, and $Y_i$ is the label assigned to that bag. Each bag $X_i$ contains a variable number of instances $\{x_1, x_2, \ldots, x_N\}\in X_i$. These instances have labels $\{y_1, y_2, \ldots, y_N\}$ with $y_n \in \{0,1\}$. However, the labels for individual instances are unknown during the training phase. If any instance in a bag belongs to the positive class, then the bag is considered positive. Conversely, if all the instances in a bag belong to the negative class, the bag is considered negative.

\vspace{-10pt}

\begin{align*}
Y_i =
\begin{cases}
1, & \text{if } \exists  x_{n} \in X_i \text{ such that } y_n = 1 \\
0, & \text{otherwise}
\end{cases}
\end{align*}

Our methodology extends this formulation to multiple positive classes for subtype LN classification. Unlike lung, brain, and breast datasets, renal pathology primarily focuses on a limited region of interest, particularly the glomerular area, allowing us to use recurrent networks. Glomeruli play a pivotal role in various renal diseases, including LN. Instead of providing MIL with all WSI patches, we exclusively use glomerular patches, enhancing precision by avoiding potential noise. Recognizing the laborious labeling at the glomerular area, we aimed to eliminate the need for intermediate glomerular-level labels; thus, opting for weakly supervised approaches is an appropriate option.

Our novel end-to-end MIL architecture for LN classification, LupusNet, works on raw glomerular patches extracted using a fine-tuned YOLOv4 model \cite{hemmatirad2023investigation}, with two key components: (a) Feature Extractor ($f$) and (b) Feature Aggregator ($g$), jointly trained. $f$ transforms inputs into an information-rich feature space using a ResNet-50 network pre-trained on histopathology images \cite{kang2023benchmarking}. We built on CLAM principles \cite{Lu2021}, which utilizes gated attention pooling and instance-level clustering to distinguish positive from negative samples. Gated attention, however, cannot fully exploit the uniformity of class 5 lupus nephritis glomeruli, hindering its ability to achieve optimal efficacy in capturing its consistent patterns. We hypothesize that adding contextual information among all glomeruli patches will improve the performance. To address this, we integrate self-attention and Bi-LSTM into the MIL framework, enhancing contextual understanding among instances (patches) in a WSI.

Suppose, in a WSI bag $X$, we have $N$ glomerular patches, and the Feature Extractor $f$ transforms each image $x_n \in \mathbb{R}^{224 \times 224 \times 3}$ into a $h$ vector of dimension $d \in \mathbb{R}^{1 \times d}$. For $N$ such images, we obtain a matrix $H \in \mathbb{R}^{N \times d}$ (eq: \ref{eq1}). Our feature aggregator can further be divided into three branches: (1) Gated Attention Pooling, (2) Self-Attention + LSTM and (3) Instance-level Clustering. In Branch $1$, the gated attention block assigns attention scores $A^g = \{a^g_1, a^g_2, \ldots, a^g_N\} \in \mathbb{R}^{1 \times N}$ to every instance (eq: \ref{eq2}), followed by instance-level clustering using $A^g$ as pseudo labels for confident instances (Branch 3).


\vspace{-5mm}

\begin{align}
    H &= f(X;\Theta) \quad \text{where }H = \{h_1, h_2, \ldots, h_N\}
    \label{eq1}
\end{align}

\vspace{-5mm}


\begin{align}
    a_k^g = \frac{W_c^T (\tanh(W_a h_{k}^T) \odot \sigma(W_b h_{k}^T))}{\sum^N_{j=1} W_c^T (\tanh(W_a h_{j}^T) \odot \sigma(W_b h_{j}^T))}
    \label{eq2}
\end{align}

\vspace{-5mm}

\begin{align}
    C^g = \sum_{k=1}^N a_k^g h_k
    \label{eq3}
\end{align}

% \vspace{-5mm}

where $W_a, W_b \text{ and } W_c$ are trainable parameters, $a_k^g$ can be supposed as positive probability of instances. $\sigma$ represents sigmoid function and $\odot$ represents element-wise multiplication. $C^g$ is the output context vector of Branch 1 (eq: \ref{eq3}).

In Branch 2, initially, $H$ goes to MHA, yielding contextualized output among instances ($A^s$). Self-attention (eq: \ref{eq4}) enables context consideration between every instance pair, and the multi-head mechanism focuses on modeling various such contextual relationships and dependencies among instances. The attention scores obtained from different heads, $n_h$ is a total number of heads, are concatenated, and a linear transformation is applied to ensure that the resulting shape matches the input, resulting in $\mathbb{R}^{n \times d}$ (eq: \ref{eq5}). To further process this contextualized information, we employ LSTM, which uses gating mechanisms and outputs the hidden layer of the last time step $\mathbb{R}^{1 \times d}$.

\vspace{-5mm}

\begin{align}
    a^{self}_i&= \text{softmax}\left(\frac{Q_i K_i^T}{\sqrt{d_k}}\right) V_i
\label{eq4}
\end{align}

\vspace{-5mm}

\begin{align}
    A^s &= (a^{self}_1 \oplus a^{self}_2 \oplus \ldots \oplus a^{self}_{n_h}) W_o
    \label{eq5}
\end{align}

where $Q_i = HW_i^Q, K = HW_i^K, $ and $ V = HW_i^V$, for the $i^{th}$ head, are derived using trainable parameters $W_i^Q, W_i^K, W_i^V$, and $W_o$ linearly transforms the multi-head outputs. $d_k$ is used for scaling to prevent the dot product from becoming too large, and $C^s$ is the bi-LSTM processed output context vector from Branch 2 on $A^s$.

Furthermore, we use softmax normalized learnable parameters $s_0$ and $s_1$ to adaptively aggregate contributions from each pipeline's output. A scaling learnable parameter $\gamma$ fine-tunes the overall merged output contribution, introducing an additional degree of freedom in the weighting process (eq: \ref{eq7}). Inspired by attention principles, this approach facilitates contextual understanding and dynamic weighting for effective information extraction from both branches. It draws parallels from a multiple-layer fusion of contextual embeddings in ELMO during downstream tasks \cite{elmo}. 

\vspace{-10pt}

\begin{align}
    logits &= \gamma \left(s_0  C^g + s_1 C^s\right)
    \label{eq7}
\end{align}


After applying the adaptive aggregation method, a binary classifier with a single neuron and a sigmoid activation function is used to estimate the probabilities, $y$, of a slide being positive. Subsequently, binary cross-entropy loss is computed at the slide level (Branch 1 and 2), while Smooth SVM loss~\cite{Lu2021} is applied for instance-level clustering (Branch 3). The Smooth SVM loss, a generalization of traditional cross-entropy classification loss, accommodates diverse margin values and temperature scaling strategies, providing flexibility to mitigate overfitting. The rationale for choosing Smooth SVM loss lies in addressing potential noise in pseudo-labels, offering robustness in the presence of uncertainties. The total loss, as per Equation \ref{eq8}, is calculated as the weighted sum of both losses, where $H'$ and $A{^g}'$ are the subset of $H$ and $A^g$ respectively, $\hat{y}$ is the ground truth, and $\beta$ is a hyper-parameter.

\vspace{-5mm}

\begin{align}
    J &= \beta \text{ BCE}(y, \hat{y}) + (1- \beta) \text{ Smooth-SVM}(H',A{^g}')
    \label{eq8}
\end{align}

\vspace{-5mm}

\section{Experiments and Results}
\label{sec:evaluations}

\subsection{Experiment Setup}
For a robust evaluation of classification performance, we employed 10-fold cross-validation. All methods were implemented in PyTorch and trained on a single NVIDIA RTX 3080ti GPU. The patch size for the YOLOv4-based glom detector was set to $6000 \times 6000$, and the MIL training involved 50-200 epochs with early stopping. $n_h = 4$, $\beta = 0.8$, a Bi-LSTM hidden dimension of 512, and Adam optimizer with $lr = 1e4$. Batch size is set to 1 for all models. Our code is available on GitHub \footnote{\href{https://github.com/CancerDiag/LupusNet}{Code: https://github.com/CancerDiag/LupusNet} }.

\subsection{Results}

\begin{table}[h!]
\floatconts
  {tab:results}%
  {\caption{Comparing our proposed model (LupusNet) with baselines, averaging results (in \%) over 10-fold cross-validation on test cohort. Input types include GP (Only Glomeruli Patches) and AP (All Patches).}}%
  {\begin{tabular}{ccccc}
\hline
Model & Input & Test AUC & Test F1 & Test ACC \\
\hline
ResNet-101 & GP & 52.88 ± 20.54 & 44.12 ± 23.01 & 53.23 ± 18.26 \\
Vanilla ViT & GP & 67.00 ± 19.22 & 56.96 ± 23.58 & 62.22 ± 18.14 \\

Max Pool & GP & 81.00 ± 16.12 & 72.82 ± 16.04 & 73.33 ± 15.89\\
Average Pool & GP & 85.50 ± 11.89 & 76.98 ± 17.30 & 77.78 ± 16.93\\
% \begin{tabular}[x]{@{}c@{}}CLAM \end{tabular} & 0 & 0 & 0 \\
% \begin{tabular}[x]{@{}c@{}}CLAM\\(only glomerulus images)\end{tabular} & 0 & 0 & 0 \\
CLAM-SB & AP & 57.65 ± 18.00 & 52.22 ± 15.34 & 52.43 ± 11.26 \\
CLAM-SB & GP & 86.00 ± 14.78 & 72.80 ± 12.66 & 75.55 ± 10.48 \\
DSMIL & GP & 79.50 ± 16.35 & 68.34 ± 17.98 & 71.11 ± 16.08 \\
TransMIL & GP & 54.50 ± 25.73 & 50.11 ± 24.17 & 54.44 ± 22.08 \\
\textbf{LupusNet (Ours)}& \textbf{GP} &\textbf{91.00 ± 08.91}&\textbf{77.30 ± 06.80}&\textbf{81.11 ± 05.36} \\
\hline
\end{tabular}}
\end{table}

We established baselines using a pseudo-labeling approach for lack of detailed glomerulus-level labels by assigning whole slide labels to all glomeruli and tested models like AlexNet, ResNet, and DenseNet, with ResNet-101 performing best (Table \ref{tab:results}).  These experiments underscored the challenge of label inconsistency among similar glomeruli in lupus classes 4 and 5, affecting model accuracy and emphasizing the need for alternative methods in the absence of precisely labeled datasets.

Furthermore, we employed an end-to-end vanilla Vision Transformer (ViT) \cite{dosovitskiy2020image} on glomeruli patches followed by weakly supervised max-pooling, average pooling, CLAM single-branched variant (CLAM-SB), DSMIL \cite{li2021dual} and TransMIL \cite{Shao2021TransMILTB} and our proposed LupusNet on the in-house dataset. CLAM-SB results are presented for both scenarios, wherein we either input all the WSI patches or just the glomeruli patches. Pooling methods showed competitive performance, with max pooling achieving 81.00\% AUC, 72.82\% F1 score, and 73.33\% accuracy, and average pooling resulting in 85.50\% AUC, 76.98\% F1 score, and 77.78\% accuracy. The conclusive findings, as shown in Table \ref{tab:results}, demonstrate that LupusNet outperforms all baseline models. We can empirically observe a significant performance improvement when only glomeruli patches are provided, consequently reducing noise to the weakly supervised models. Additional observation showed LupusNet outperforming CLAM-SB (GP), by a significant F1-score improvement for class 5 LN (65.17\% to 77.03\%), highlighting its efficacy in distinguishing the two classes, reducing false positives and enhancing precision.

\section{Ablation Study}

\begin{table}[h!]
\floatconts
  {tab:ablation}%
  {\caption{Ablation study with module variations. \\ \textbf{L}=LSTM; \textbf{G}=Gated Attention; \textbf{C}=Clustering (Instance level)}}%
  {\begin{tabular}{ccccc}
\hline
Model & Test AUC & Test F1 & Test ACC \\
\hline
LSTM & 64.00 ± 15.77 & 56.27 ± 9.70 & 60.00 ± 10.73\\
% \begin{tabular}[x]{@{}c@{}}CLAM \end{tabular} & 0 & 0 & 0 \\
% \begin{tabular}[x]{@{}c@{}}CLAM\\(only glomerulus images)\end{tabular} & 0 & 0 & 0 \\
L+G & 81.65 ± 13.90 & 67.00 ± 16.32 & 71.11 ± 11.94\\
L+G+C & 85.00 ± 12.24 & 74.91 ± 15.56 & 77.78 ± 12.83\\
ViT + G + C & 73.00 ± 18.01 & 64.99 ± 13.76	& 66.67 ± 12.05 \\
\textbf{LupusNet (Ours)}& \textbf{91.00 ± 08.91}&\textbf{77.30 ± 06.80}&\textbf{81.11 ± 05.36} \\
\hline
\end{tabular}}
\end{table}

In our ablation study, we methodically introduced various architectural components to evaluate their individual and combined effects on the model's performance. Beginning with a basic LSTM model as our starting point, we then integrated Gated Attention and Instance-level clustering. Each addition led to noticeable improvements in performance, as shown in Table \ref{tab:ablation}, with our final model, LupusNet, outperforming all other configurations. This step-by-step process helped us identify the specific contributions of each component to the model's overall effectiveness in classifying two LN classes. We further optimized LupusNet by adjusting the learning rates and the number of Multi-Head Attention (MHA) blocks (Figure \ref{fig:4}).

\begin{figure}[]
\floatconts
  {fig:4}
  {\caption{Hyperparameter tuning of LupusNet based on the optimized value of learning rate (left) and number of attention heads (right)}}
  {\includegraphics[width=0.7\textwidth]{images/zlr_f1.png}}
  \label{fig:4}
\end{figure}

\section{Discussion and Conclusion}
\label{sec:majhead}
Our study has showcased the application of MIL for LN subtype classification, which uses only slide-level labels, eliminating the necessity for glomeruli-level labels. Our idea was to explore how weakly-supervised methods perform in this situation and propose a framework (LupusNet) to improve it. Although using transformer-based models seems like a natural choice for their advanced context sensitivity, their empirical efficacy was suboptimal due to the reduced regions of interest. However, we recognized the need for self-attention among glomeruli for context inclusion. Therefore, our work includes this aspect without increasing network complexity by using LSTM and MHA. Furthermore, the attention weights can be assessed to infer the contribution of each glomeruli in the final classification which can also help reduce the inter and intra-variability among pathologists. Additionally, it holds significance for researchers studying other diverse renal diseases beyond the specific focus on LN. It also contributes to renal pathology research by creating a digital whole slide image dataset. While LupusNet exhibits promising results, there are areas for potential improvement. Our future work involves improving glomeruli detection models and feature aggregators, which could extract even better contextual information from glomeruli.

\textbf{Data Availability Statement:} The dataset generated and/or analyzed during the current study is available from the authors within the terms of the data use agreement and compliance with ethical and legal requirements (if any).

\section*{Compliance with Ethical Standards} 
Procedures in studies with human participants adhered to ethical standards set by institutional (NIMS) and/or national research committees (ICMR).
\clearpage
\midlacknowledgments{We acknowledge IHub-Data, IIIT Hyderabad (H1-002) for financial assistance. We also thank Dr. Manasa Kondamadugu for project coordination, Ms. Ramya Alugam, and Mr. Akula Rajesh Goud for data digitalization and organization.}

\bibliography{midl24_132}

% \appendix

% \section{First Appendix}\label{apd:first}

% This is the first appendix.

% \section{Second Appendix}\label{apd:second}

% This is the second appendix.
\end{document}
