Abstract: In Visual Internet of Things (VIoT), visualized sensors like surveillance cameras play as a key component in smart cities, generating a large amount of recorded data in real time. Under this scenario, semantic person retrieval aims to locate certain person from real-world surveillance images based on semantic descriptions. Most previous works were based on the assumption that the semantic description can provide enough details to locate a target person, namely, “precise person retrieval.” However, this assumption cannot be satisfied in many real-world applications, where we only have fuzzy semantic descriptions and expect to pick out a set of targets. As the “fuzzy person retrieval” task has not been deeply explored by previous works, we propose a novel efficient one-stage method FuzzyPR. In our work, we perform multihead visual-semantic feature alignment to against the asymmetry between the text and image information. To improve the model’s ability of “inference and associative” during the fuzzy retrieval process, we design a multigranular semantic retrieval proxy task to improve the associative ability of the localization module. Experimental results demonstrate that FuzzyPR achieves the best retrieval accuracy and efficiency on fuzzy semantic retrieval task.
External IDs:dblp:journals/iotj/SunLRZ25
Loading