SimGroupAttn: Similarity-Guided Group Attention for Vision Transformer to Incorporate Population Information in Plant Disease Detection

Wangyang Wu; Ribana Roscher; Niklas Tötsch

SimGroupAttn: Similarity-Guided Group Attention for Vision Transformer to Incorporate Population Information in Plant Disease Detection

Wangyang Wu, Ribana Roscher, Niklas Tötsch

Published: 05 Nov 2025, Last Modified: 05 Nov 2025NLDL 2026 SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Vision Transformer (ViT), Attention Mechanisms, Masked Image Modeling (MIM), Cross-Image Attention, Representation Learning, Plant Disease Detection, Agricultural AI, Population-Level Context

TL;DR: This paper introduces SimGroupAttn, a novel Vision Transformer attention mechanism that leverages cross-image similarity to incorporate population-level context, improving representation learning and classification for plant disease detection.

Abstract: In this paper, we address the problem of Vision Transformer (ViT) models being limited to intra-image attention, which prevents them from leveraging cross-sample information. This is highly relevant in agricultural data such as plant disease detection, an important challenge in agriculture where early and reliable diagnosis helps protect yields and food security. Yet existing methods often fail to capture subtle or overlapping symptoms that only become evident when considered in a population context. Our approach $\textit{SimGroupAttn}$ extends masked image modeling by enabling image patches to attend not only within their own image but also to similar regions across other images in the same batch. Guided by cosine similarity score which is trained jointly with model weights,$\textit{SimGroupAttn}$ incorporates population-level context into the learned representations, making them more robust and discriminative. Extensive experiments on PlantPathology dataset demonstrate that our approach outperforms Simple Masked Image Modeling (SimMIM) and Masked Autoencoders (MAE) in linear probing and classification task. It improves top-1 accuracy by up to 6.5\% in linear probing for complex classes and 3.5\% in classification compared with the best baseline model performance under the same settings.

Serve As Reviewer: ~Ribana_Roscher2

Submission Number: 30

Loading