Abstract: Highlights•EVA-02, a plain Transformer-based visual representation, demonstrates superior performance in various vision tasks.•EVA-02 reduces model size through robust optimization, advanced activation functions, and position embedding.•EVA-02 achieves 90.0 fine-tuning top-1 accuracy on ImageNet-1K with only 304 M parameters.•EVA-02-CLIP outperforms the best open-sourced CLIP in zero-shot ImageNet-1K classification, using less training data.
Loading