Abstract: Single-thread performance improvement remains a central design goal for general purpose processors. Microarchitectural designs for the core have reached a plateau over the past years. However, we are still far from exhausting the implicit parallelism available in today's programs. One approach is to use a separate thread context to improve data and instruction supply to the main pipeline. Such decoupled look-ahead (DLA) architectures have been shown to be an effective way to improve single-thread performance. However, a default implementation requires an additional core. While an SMT flavor is possible, a naive implementation is inefficient and thus slow. In this paper, we propose an optimized implementation called Bootstrapping that makes DLA just as effective on a single (SMT) core as using two cores. While fusing two cores can improve single-thread performance by 1.22x, Bootstrapping provides a speedup of 1.48 over a broad range of benchmark suites, making it a compelling microarchitectural feature for general-purpose microarchitectures.
Loading