Distilling Cross-Domain Knowledge for Person Re-ID by Aligning Any Pretrained Encoder with CLIP Textual Features

Pengfei Li; Li Sun; Qingli Li

Distilling Cross-Domain Knowledge for Person Re-ID by Aligning Any Pretrained Encoder with CLIP Textual Features

Pengfei Li, Li Sun, Qingli Li

27 Sept 2024 (modified: 14 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: CLIP; Person ReID; Knowledge Distillation

Abstract: Based on the alignment of image-text pairs, CLIP has demonstrated superior performance across various tasks, even in a zero-shot setting. In person ReID, CLIP-based models achieve state-of-the-art results without explicit text descriptions for further fine-tuning. However, previous models are primarily initialized with weights from ImageNet or self-supervised methods, lacking cross-domain knowledge in both image and text areas. This paper introduces a novel approach that aligns a pure image-domain pretrained student model with CLIP textual features, distilling cross-domain knowledge from existing CLIP-ReID into the online student model. To leverage CLIP’s textual features for each ID, we address the challenge of mismatched feature dimensions between the teacher and student. A trainable adapter is inserted on the student side to match dimensions and preserve the prior knowledge within the pretrained student. For the student encoder yielding lower or equal-dimensional features compared to the teacher, the adapter is initialized as an identity matrix, while offline PCA is employed on the teacher side for dimensionality reduction. PCA eigenvectors are computed from all training images and applied to existing text features for matching with the student. In cases where the student outputs exceed the teacher's dimensions, the adapter is initialized using eigenvectors computed from the student side to retain knowledge in the pretrained student model. After dimension alignment, text features for each ID are compared with online image features, specifying cross-domain similarities, which are further constrained to mimic the teacher through a KL-divergence loss. Experiments with different pretraining encoder structures demonstrate the effectiveness of this approach, which is also compatible with relation knowledge distillation to enhance performance.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 10488

Loading