LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification

Yiding Lu; Mouxing Yang; Dezhong Peng; Peng Hu; Yijie Lin; Xi Peng

LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification

Yiding Lu, Mouxing Yang, Dezhong Peng, Peng Hu, Yijie Lin, Xi Peng

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Traditional text-based person ReID assumes that person descriptions from witnesses are complete and provided at once. However, in real-world scenarios, such descriptions are often partial or vague. To address this limitation, we introduce a new task called interactive person re-identification (Inter-ReID). Inter-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses. To facilitate the study of this new task, we construct a dialogue dataset that incorporates multiple types of questions by decomposing fine-grained attributes of individuals. We further propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts to elicit additional details about the target person. Leveraging a looking-forward strategy, we prioritize the most informative questions as supervision during training. Experimental results on both Inter-ReID and text-based ReID benchmarks demonstrate that LLaVA-ReID significantly outperforms baselines.

Lay Summary: Imagine you’re trying to help security staff find someone you saw earlier on a busy street or in a shopping mall. You might say, “He was tall, wearing a plaid shirt, and carrying a bag.” But such descriptions are often vague and incomplete, making it hard for computer systems to identify the person in surveillance footage. Our research proposes a more interactive solution. Instead of relying on a one-time description, our system asks follow-up questions to refine your memory, much like a helpful assistant. It might ask, “What color were his pants?” or “Was he carrying anything else?” This approach helps systems find the right person more accurately and efficiently in real-world settings like malls, transit hubs, or office buildings.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://github.com/XLearning-SCU/LLaVA-ReID

Primary Area: Applications->Computer Vision

Keywords: Person Re-Identification, Interactive Retrieval

Submission Number: 249

Loading