Abstract: In high performance computing (HPC) systems, collecting and replaying communication traces are fundamental approaches to analyze performance. With increasingly large-scale HPC systems and applications, tracing tools can produce huge trace data that is costly and challenging to store and analyze. Due to the inherent repetition of behaviors of HPC applications, domain-aware data compression methods can effectively reduce the storage cost of trace data. This study proposes LCR (Lossy Compression and Replay), a framework that aggressively compresses and replays MPI communication traces. Differing from existing trace compression methods, which explicitly identify loop and synchronization structures of communication events, LCR models traces as time series and compactly represents them by lightweight recurrent neural networks. Experimental results demonstrate that LCR can further reduce the size of irregular traces by three orders of magnitude at most, compared with existing structural methods. Meanwhile, LCR accurately reproduces performance and communication patterns of original MPI programs.
0 Replies
Loading