Keywords: hybrid system, computer vision, object detection, open vocabulary detection, vision-language model, mivar, mivar expert system, uav, dangerous situation
TL;DR: The proposed hybrid approach combines the localization of new objects using an open-vocabulary detector and the interpretability of mivar expert systems, solving the problem of their limited flexibility in dynamic conditions
Abstract: State-of-the-art computer vision systems today predominantly rely on neural network models. The key advantage of such models is the efficient processing of unstructured data and the automatic extraction of relevant features without manual feature engineering. Classical machine learning methods typically depend on extensive manual feature engineering and preprocessing. While neural network-based computer vision algorithms deliver high performance, their results are difficult to interpret and prone to hallucinations.
Interpretable models are those built on the Multidimensional Information Variable Adaptive Reality (MIVAR). They represent expert systems that provide logically sound analysis and interpretable solutions. A key limitation of MIVAR expert systems (MES) is their rigid formalization and lack of capability to process raw perceptual data without prior feature extraction. In recent years, a hybrid approach has become increasingly popular. It combines the advantages of neural network and mivar methods: feature extraction using a neural network and logical inference using mivar rules.
Typically, the neural network detector in hybrid systems has a fixed set of classes, making it inflexible. This is a problem for dynamic scenarios, such as detecting dangerous situations using unmanned aerial vehicles (UAVs) in rescue operations. The emergence of new relevant objects requires labor-intensive retraining of the model.
The paper proposes a new hybrid approach. It combines a mivar expert system with an open-vocabulary object detector. Unlike conventional closed-set detectors, the proposed system leverages textual prompts to embed both visual and textual representations into a shared semantic space. This allows the UAV operator to dynamically direct the model's attention to arbitrary objects not included in the training set.
As a result, a hybrid system was developed. It includes a finely tuned Open-Vocabulary detector based on a combined VisDrone and SARD2 dataset (images of people and vehicles captured from a drone's flight altitude) and a mivar module with a set of logical rules. Mivar module analyzes the detector's output and makes decisions about the presence of dangerous situations.
The proposed system retains the advantages of the hybrid approach and expands them by detecting previously unknown objects using text prompts. This makes the approach promising for real-world applications in rapidly changing scenarios.
To evaluate the effectiveness of the proposed approach, a dataset consisting of VisDrone and SARD2 test images was used. The pre-trained model demonstrated metrics of 0.0974 (mAP@50) and 0.0673 (mAP@50:95). After fine-tuning, the model's performance significantly improved, reaching 0.342 and 0.202 for the corresponding metrics. After additional training, the model not only demonstrated improved vehicle detection accuracy but also exhibited reliable human pose estimation from UAV imagery.
Submission Number: 50
Loading