Abstract: Highlights•Vision transformers suffer from feature collapsing in deeper layers.•Residual attention contrast feature collapsing.•Vision transformers with residual attention learn better representations.•Residual attention improves the ViT’s performance in visual recognition tasks.
External IDs:doi:10.1016/j.patcog.2024.110853
Loading