Abstract: Modern vision backbones for 3D medical imaging typically
processdensevoxelgridsthroughparameter-heavyencoder-decoderstruc-
tures, a design that allocates a significant portion of its parameters to
spatial reconstruction rather than feature learning. Our approach intro-
duces SVGFormer, a decoder-free pipeline built upon a content-aware
groupingstagethatpartitionsthevolumeintoasemanticgraphofsuper-
voxels. Its hierarchical encoder learns rich node representations by com-
biningapatch-levelTransformerwithasupervoxel-levelGraphAttention
Network, jointly modeling fine-grained intra-region features and broader
inter-regionaldependencies.Thisdesignconcentratesalllearnablecapac-
ity on feature encoding and provides inherent, dual-scale explainability
from the patch to the region level. To validate the framework’s flexibility,
we trained two specialized models on the BraTS dataset: one for node-
level classification and one for tumor proportion regression. Both models
achieved strong performance, with the classification model achieving a
F1-score of 0.875 and the regression model a MAE of 0.028, confirming
the encoder’s ability to learn discriminative and localized features. Our
results establish that a graph-based, encoder-only paradigm offers an
accurate and inherently interpretable alternative for 3D medical image
representation.
Loading