Abstract: Behavioral data extracted from emulators or real devices, such as system calls, are utilized in research studies where machine learning models have been employed for mobile malware detection. However, these studies do not explore whether the selection of data source may have an impact on the performance of the models, assuming that both data sources generate similar outputs. We provide a comparative analysis of the data sets obtained from both sources by using statistical techniques, inducing learning models and demonstrating the impact of data source selection on detection models' performance. Our study shows that emulators generate more distinguishable data than real devices, meaning that designers of detection models should pay attention to the data sources utilized in the various steps of the machine learning workflow.
0 Replies
Loading