Abstract: Sketch person re-identification aims to match hand-drawn sketches with RGB surveillance images, yet remains challenging due to severe modality discrepancies and limited labeled data. To address this, we propose KTCAA, a theoretically grounded and interpretable framework for few-shot cross-modal transfer learning. From the perspective of generalization bounds, we identify two key controllable factors essential to minimizing target domain error: (1) domain discrepancy, which reflects the difficulty of aligning source and target distributions in the feature space; and (2) perturbation invariance, which measures the model’s robustness to cross-modal variations. To address these challenges, we design two corresponding modules: (1) Alignment Augmentation (AA), which applies localized sketch-style transformations to simulate target modality distributions, introducing slight but meaningful feature shifts that guide gradual distribution alignment during training; and (2) Knowledge Transfer Catalyst (KTC), which enhances perturbation invariance by generating worst-case adversarial modality perturbations and enforcing output consistency under such perturbations. These modules are jointly optimized within a meta-learning framework that transfers alignment knowledge from RGB-rich domains to sketch scenarios. Experiments on multiple benchmarks demonstrate that KTCAA achieves state-of-the-art performance, particularly under limited data and cross-domain transfer settings.
Loading