Abstract: Highlights•Low-dimensions are adopted to guide original space to expand the fields-of-view.•Input tokens are represented as wave for dynamic aggregation based on local maximum.•We fuse representations simultaneously by combining CNN and Vision MLP in parallel.•Our proposed algorithm outperforms other popular methods.
Loading