HpT: Hybrid Acceleration of Spatio-Temporal Attention Model Training on Heterogeneous Manycore Architectures

Saiman Dahal, Pratyush Dhingra, Krishu K. Thapa, Partha Pratim Pande, Ananth Kalyanaraman

Published: 2025, Last Modified: 07 Mar 2026IEEE Trans. Parallel Distributed Syst. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Transformer models have become widely popular in numerous applications, and especially for building foundation large language models (LLMs). Recently, there has been a surge in the exploration of transformer-based architectures in non-LLM applications. In particular, the self-attention mechanism within the transformer architecture offers a way to exploit any hidden relations within data, making it widely applicable for a variety of spatio-temporal tasks in scientific computing domains (e.g., weather, traffic, agriculture). Most of these efforts have primarily focused on accelerating the inference phase. However, the computational resources required to train these attention-based models for scientific applications remain a significant challenge to address. Emerging non-volatile memory (NVM)-based processing-in-memory (PIM) architectures can achieve higher performance and better energy efficiency than their GPU-based counterparts. However, the frequent weight updates during training would necessitate write operations to NVM cells, posing a significant barrier for considering stand-alone NVM-based PIM architectures. In this paper, we present HpT, a new hybrid approach to accelerate the training of attention-based models for scientific applications. Our approach is hybrid at two different layers: at the software layer, our approach dynamically switches from a full-parameter training mode to a lower-parameter training mode by incorporating intrinsic dimensionality; and at the hardware layer, our approach harnesses the combined power of GPUs, resistive random-access memory (ReRAM)-based PIM devices, and systolic arrays. This software-hardware co-design approach is aimed at adaptively reducing both runtime and energy costs during the training phase, without compromising on quality. Experiments on four concrete real-world scientific applications demonstrate that our hybrid approach is able to significantly reduce training time (up to $11.9\times$) and energy consumption (up to $12.05\times$), compared to the corresponding full-parameter training executing on only GPUs. Our approach serves as an example for accelerating the training of attention-based models on heterogeneous platforms including ReRAMs.

External IDs:dblp:journals/tpds/DahalDTPK25