Abstract: This paper presents the design, implementation, and application
of TALP, a lightweight, portable, extensible, and scalable tool for
online parallel performance measurement. The efficiency metrics
reported by TALP allow HPC users to evaluate the parallel effi-
ciency of their executions, both post-mortem and at runtime. The
API that TALP provides allows the running application or resource
managers to collect performance metrics at runtime. This enables
the opportunity to adapt the execution based on the metrics col-
lected dynamically. The set of metrics collected by TALP are well
defined, independent of the tool, and consolidated. We extend the
collection of metrics with two additional ones that can differenti-
ate between the load imbalance originated from the intranode or
internode imbalance. We evaluate the potential of TALP with three
parallel applications that present various parallel issues and care-
fully analyze the overhead introduced to determine its limitations.
0 Replies
Loading