Advancing Cost Efficiency and Robustness of Machine Learning through the Lens of Data

Nezihe Merve Gürel

Published: 01 Jan 2022, Last Modified: 28 Apr 2023undefined 2022Readers: Everyone

Abstract: ML systems contend with an ever-growing processing load of physical world data. These systems are required to deliver high-quality learning and decision-making often constrained by limited resources. This need has led to a proliferation of optimization techniques at model and implementation levels over the past decades. The model and implementation-focused nature of these techniques, however, challenges their generalizability across different application domains and different stages of the ML pipeline where the problem may be as acute. This dissertation identifies several open problems to which current cost-optimization strategies do not directly apply or are ineffective, and offers theoretically sound and repeatable strategies that maintain practical performance without any discernible loss in quality. These strategies adopt a data-focused view to reduce dependency on the learner, and enhance the cost-effectiveness of ML pipelines by reducing the amount of data to process and their robustness through supplying domain knowledge in replacement of robust training data. First, we focus on hardware efficiency and investigate training with low precision data representation to accelerate the processing of compute-intensive workloads on hardware. Inspired by the number of application domains associated with it, we focus on sparse signal reconstruction problems where compressive sensing can be employed. By lowering the data precision and co-designing the reconstruction algorithm, we show that compressive sensing can be significantly accelerated on hardware such as FPGA and CPU with negligible loss of reconstruction quality. We develop theory which analyzes the scaling of recovery error with respect to bit precision, and empirically demonstrate the benefit of low precision compressive sensing in the context of real-world applications. Next, we move our attention to labor-intensive workloads across the ML pipeline. We specifically focus on the post-training stages --- which often encounter a mismatch between the distributions of production and training data, and requires curation for it. To account for that in a labor-efficient manner, we introduce an active model selection strategy for pretrained models where the best pretrained model for the downstream task can be found by labeling only a small portion of freshly collected production data. We show that such a specialized data sampling strategy can significantly improve label efficiency at the later stages of the ML pipeline by accounting for the production data shift. Closely related to the contribution of model selection, we also study the oversmoothing in graph neural networks and rigorously identify the role of architectural model differences in terms of graph decomposition. The final contribution of this thesis is on the ML robustness front, where we improve adversarial robustness by using domain knowledge. In particular, we develop a knowledge enhanced ML pipeline, the first framework that integrates domain knowledge to enhance the adversarial robustness of ML classifiers against a diverse set of attacks throughout the pipeline. Our framework is generic, efficient, and can be applied at different stages of the ML pipeline. From the perspective of trustworthy ML, we show that domain knowledge, as a robust and tenable proxy of data, can mimic the robust features relating to the prediction variable and provide a defense whose robustness is agnostic to the type of adversary. Finally, we formulate a theoretical foundation to identify the regime of improvement in terms of quality of domain knowledge and demonstrate its practical performance against a diverse collection of attacks.

0 Replies