This is the model we reproduced, where some parameters are computed. The inputs to these models are unified as feature extractor features. Since the reference links will be annotated in detail in open source.

Among them, the setting of n=1 is very normal (that is, one token for each batch). On the one hand, it is for simplicity (considering that some methods require isometric alignment in order to calculate the contrast loss). On the other hand, the npy array provided by APIVR is itself one-dimensional.
