Abstract: Semantic edge detection (SED) is pivotal for the precise demarcation of object boundaries, yet it faces ongoing challenges due to the prevalence of low-quality labels in current methods. In this paper, we present a novel solution to bolster SED through the encoding of both language and image data. Distinct from antecedent language-driven techniques, which predominantly utilize static elements such as dataset labels, our method taps into the dynamic language content that details the objects in each image and their interrelations. By encoding this varied input, we generate integrated features that utilize semantic insights to refine the high-level image features and the ultimate mask representations. This advancement markedly betters the quality of these features and elevates SED performance. Experimental evaluation on benchmark datasets, including SBD and Cityscape, showcases the efficacy of our method, achieving leading ODS F-scores of 79.0 and 76.0, respectively. Our approach signifies a notable advancement in SED technology by seamlessly integrating multimodal textual information, embracing both static and dynamic aspects.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: This research makes a significant contribution to multimedia/multimodal processing by proposing a robust method that fuses textual and visual modalities for Semantic Edge Detection (SED). The innovative cross-modal feature fusion technique leaps forth multimodal processing by effectively leveraging semantics from both sources. This results in considerable improvements to boundary detection and classification. In addition to enhancing model robustness, the incorporation of zero-shot learning enables the detection of categories absent in the training data - this is paramount in dealing with real-world, diverse multimedia content. The work's applicability was demonstrated with remarkable ODS F-scores on benchmark datasets and its compatibility with advanced edge detectors, promising major enhancements for applications depending upon image processing. By making the code available, it supports further advancements in multimodal processing methodologies, encouraging peer scrutiny, improvements, and widespread application.
Submission Number: 1242
Loading