A Theory-Inspired Framework for Few-Shot Cross-Modal Sketch Person Re-Identification

Yunpeng Gong, Yongjie Hou, Jiangming Shi, KIM LONG DIEP, Min Jiang

Published: 13 Mar 2026, Last Modified: 25 Mar 2026Proceedings of the AAAI Conference on Artificial Intelligence (AAAI)EveryoneCC BY-NC-ND 4.0

Abstract: Sketch person re-identification aims to match hand-drawn sketches with RGB surveillance images, yet remains challenging due to severe modality discrepancies and limited labeled data. To address this, we propose KTCAA, a theoretically grounded and interpretable framework for few-shot cross-modal transfer learning. From the perspective of generalization bounds, we identify two key controllable factors essential to minimizing target domain error: (1) domain discrepancy, which reflects the difficulty of aligning source and target distributions in the feature space; and (2) perturbation invariance, which measures the model’s robustness to cross-modal variations. To address these challenges, we design two corresponding modules: (1) Alignment Augmentation (AA), which applies localized sketch-style transformations to simulate target modality distributions, introducing slight but meaningful feature shifts that guide gradual distribution alignment during training; and (2) Knowledge Transfer Catalyst (KTC), which enhances perturbation invariance by generating worst-case adversarial modality perturbations and enforcing output consistency under such perturbations. These modules are jointly optimized within a meta-learning framework that transfers alignment knowledge from RGB-rich domains to sketch scenarios. Experiments on multiple benchmarks demonstrate that KTCAA achieves state-of-the-art performance, particularly under limited data and cross-domain transfer settings.