Abstract: Highlights•A novel text-based person search network is proposed by reducing modal differences while learning sufficient modal features.•A multi-granularity feature self-optimization module is designed to optimize the multiscale image modal feature and multi-level semantic text modal feature, so as to learn more discriminative features with suppressing useless and redundant information.•A cross-instance feature alignment is proposed to construct image–text feature pairs with category-level information participating in training.•Extensive experiments in both CUHK-PEDES and ICFG-PEDES datasets show our MAPS obtains the state-of-the-art performance, which significantly outperforms other existing methods.