Abstract: Highlights•Vision transformers suffer from feature collapsing in deeper layers.•Residual attention contrast feature collapsing.•Vision transformers with residual attention learn better representations.•Residual attention improves the ViT’s performance in visual recognition tasks.
Loading