Multi-objective Optimizations in Geo-Distributed Data Analytics Systems

Published: 2017, Last Modified: 21 Jan 2026ICPADS 2017EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: In geographically distributed data centers, data analytics systems have recently been developed and optimized for such geo-distributed environments. With respect to various system operators' requirements on data analytics, existing studies have optimized systems for individual goals such as resource efficiency, per-job latency and fairness. However, the optimizations with multiple objectives simultaneously have been overlooked. Even worse, some objectives can be translated to discordant actions and their relationship can be impacted by the unique features of geo-distributed data analytics systems. For example, we have observed clear trade-off between fairness and resource efficiency. In this paper, we develop an efficient framework for multi-objective optimizations on geo-distributed data analytics systems. Specifically, we develop GeoSpark, an extension to Spark, which automatically performs a multi-objective optimization according to the system operators' preferences on different objectives. The multi-objective optimization is inherently intractable especially for large-scale workloads. Therefore, we propose an efficient online heuristic to approximate the optimal scheduling plan while achieving a lower bound guarantee in the worst case. Evaluation using synthetic workload shows that GeoSpark effectively performs the multi-objective optimizations based on system operators' preferences on different objectives. GeoSpark achieves up to 30% makespan reduction, 28% job latency reduction and better fairness guarantee compared with existing schedulers in Apache Spark in the geo-distributed setting.
Loading