Keywords: component localization, component detection, object detection, open world object detection
TL;DR: We introduce Know What You See (KWYS), using textual knowledge to build a hierarchical component taxonomy that guides open-vocabulary detection, significantly improving product component localization while reducing hallucinations
Abstract: Many real-world decisions about products (e.g. how they function, how they should be used) depend on their components rather than the object as a whole. Accurately identifying product component has applications like automated defect detection, visual spare-parts search, and verified assembly. However, existing object detectors treat components as isolated objects, ignoring their inherent structure. We propose Know What You See (KWYS), where we localize components by grounding them using a textual knowledge base (e.g., manuals or web descriptions). KWYS converts raw text into a hierarchical component taxonomy, which then guides an open-vocabulary object detector using a hierarchical verification algorithm. We evaluate on 1,000 product images across 5 diverse categories, improving component localization accuracy by 11% along with reducing component hallucinations by 25%.
Submission Type: Emerging
Copyright Form: pdf
Submission Number: 244
Loading