Data Movement Visualized: A Unified Framework for Tracking and Visualizing Data Movements in Heterogeneous Architectures

Published: 2024, Last Modified: 24 Oct 2024PacificVis 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Whereas rapidly increasing heterogeneous compute capabilities continue to facilitate further scalability, modern applications often instead get limited by suboptimal data movement, as more and more data needs to be shipped across different hardware components (i.e., CPUs, GPUs, and other types of accelerators). We posit that understanding and improving data movement in modern use-cases require a holistic understanding of the underlying Hardware usage as well as the Communication patterns within the overall context of the Application, or as we call it, the HAC domain. Collecting and correlating HAC data currently requires interacting with several profiling tools and libraries, resulting in a tedious workflow that is neither scalable nor portable. Furthermore, existing tools for visualizing data movement profiles also focus on these domains individually, rather than offering a holistic view.We present a unified framework for tracking and visualizing data movement trends in large-scale applications deployed on heterogeneous architectures. Our framework has two interoperable components. (1) DMTracker is a lean software layer that provides a simple interface for configurable HAC profiling of GPU-enabled applications and abstracts away the complexity in using several profiling tools, resulting in a streamlined and time-correlated event history across the HAC domains. (2) DMVis is a web-based tool that combines several linked visualizations to offer holistic visual insights into the runtime behavior and resources utilization of applications, proving pivotal in identifying computationally expensive tasks and data transfers across devices. In this paper, we present the design and prototype of our framework, developed in consultation with domain experts and demonstrated on two case studies, including one for a large language model training. Initial impressions from the experts indicate a positive turn in their usual workflow of observing and tuning the performance through improved data movement strategies.
Loading