Abstract: Annotation in large-scale video datasets requires significant resources. To enhance the efficiency of this process, we suggest employing pre-trained cross-modal models within the Human-in-the-Loop (HITL) paradigm. We used a synthetic video dataset to generate precise semantic annotations and assess the effectiveness of different label representations in comprehending visual information across diverse vision tasks, including fine- and coarse-grained ones. We also introduce a framework for automating pre-annotation extraction from semantically similar frames. Our approach presents promising avenues for efficiently annotating video data, crucial for developing robust Machine Learning (ML) systems.
External IDs:doi:10.1007/978-3-031-80946-0_1
Loading