Optimizing Barrier Synchronization on ARMv8 Many-Core Architectures

Published: 2021, Last Modified: 17 Jul 2025CLUSTER 2021EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Synchronization operations are commonly seen in OpenMP programs where a parallel construct often works with an explicit or implicit barrier operation. While OpenMP synchronization has been extensively studied on the traditional x86 CPU architectures, there is little work on understanding OpenMP barrier synchronization operations on ARMv8 high-performance many-cores. This paper presents the first comprehensive performance study on OpenMP barrier implementations on emerging ARMvS-based many-cores. We evaluate seven representative barrier algorithms on three distinct ARMv8 architectures: Phytium 2000+, ThunderX2, and Kunpeng920. We empirically show that the existing synchronization implementations exhibit poor scalability on ARMv8 architectures compared to the x86 counterpart. We then propose various optimization strategies for improving these widely used synchronization algorithms on each platform. We showcase that our optimizations yield 12.6x performance improvement over the GCC implementation and 4.7x improvement over the LLVM implementation, translating to 1.6x improvement over the state-of-the-art best-performing algorithm. We share our experience and practical insights on optimizing OpenMP synchronization operations on emerging ARMv8 multi-core CPU architectures.
Loading