Keywords: Vision-language models, white matter tractography, graph neural networks, medical imaging, multi-task learning
Abstract: Vision language models have achieved strong results in 2D medical imaging, yet their use in 3D white matter tractography remains largely unexplored. A core challenge is representational, since white matter contains continuous fiber bundles with complex topology that fit poorly into standard volumetric formats. We introduce TractoGraphVLM, a unified framework for tractography-language alignment that compares graph based and volumetric encoders across multiple vision language tasks. Using 725 HCP-Aging subjects, we evaluate five encoder architectures, Graph Transformer, GAT, GCN, 3D CNN, and Vision Transformer, on bundle classification, text to tract retrieval, geometric captioning, and visual question answering. We show that graph based representations clearly outperform volumetric ones across all tasks. The proposed framework reaches 93.1% classification accuracy, 86.2% retrieval Recall@1, a BLEU-4 score of 21.2 for captioning, and 68.5% accuracy for visual question answering. Results show that preserving geometric topology through graph encoding is essential for reliable tractography understanding, establishing TractoGraphVLM as the first strong benchmark for this domain. The source code and our implementation are available at: https://bit.ly/4iUhNps.
Primary Subject Area: Application: Neuroimaging
Secondary Subject Area: Geometric Deep Learning
Registration Requirement: Yes
Reproducibility: https://bit.ly/4iUhNps
Visa & Travel: Yes
Read CFP & Author Instructions: Yes
Originality Policy: Yes
Single-blind & Not Under Review Elsewhere: Yes
LLM Policy: Yes
Submission Number: 353
Loading