Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation

Dustin Carrión-Ojeda; Stefan Roth; Simone Schaub-Meyer

Efficient Masked Attention Transformer for Few-Shot Classification and Segmentation

Dustin Carrión-Ojeda, Stefan Roth, Simone Schaub-Meyer

Published: 27 Aug 2025, Last Modified: 01 Oct 2025LIMIT 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Few-shot learning, Segmentation, Classification, Efficiency

TL;DR: EMAT improves few-shot classification and segmentation, especially for small objects, while using at least four times fewer parameters than existing methods. It supports N-way K-shot tasks and correctly outputs empty masks when no target is present.

Abstract: Few-shot classification and segmentation (FS-CS) focuses on jointly performing multi-label classification and multi-class segmentation using few annotated examples. Although the current state of the art (SOTA) achieves high accuracy in both tasks, it struggles with small objects. To overcome this, we propose the **E**fficient **M**asked **A**ttention **T**ransformer (EMAT), which improves classification and segmentation accuracy, especially for small objects. EMAT introduces three modifications: a novel memory-efficient masked attention mechanism, a learnable downscaling strategy, and parameter-efficiency enhancements. EMAT outperforms all FS-CS methods on the PASCAL-$5^i$ and COCO-$20^i$ datasets, using at least four times fewer trainable parameters.

Submission Number: 11

Loading