Abstract: Research highlights•We propose MVFormer for diverse feature learning via token mixers and normalization.•The MVN combines three types of normalization, reflecting diverse feature distributions.•The MVTM enables stage specificity by diversifying receptive fields per stage.•Adopting both the MVN and MVTM together enhances the capacity for diverse viewpoints.•MVFormer surpass the existing convolution-based ViTs on ImageNet-1 K benchmark.
External IDs:dblp:journals/prl/BaeKCK25
Loading