Personalized and Temporal-aware Attention for Efficient Generative Recommendation

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Sequential Recommendation, Generative Recommendation
Abstract: Recent discoveries of scaling laws in autoregressive and large generative recommendation models have attracted considerable attention in sequential recommendation. Larger models entail increased inference latency and training costs, while most recommendation applications have stringent latency constraints. Through empirical evaluations of recent self-attention-based recommendation models, we identified two critical issues that impair model performance while increasing computational costs: inadequate modeling of temporal information and KV cache redundancy. In this paper, we propose a novel method, termed Personalized and Temporal-aware Transformer for generative recommendation (PT-Recformer), to effectively mitigate these issues. On the one hand, we introduce a novel Rotary Temporal Encoding method to effectively model extreme temporal variations inherent in user contexts. On the other hand, we propose a personalized multi-head latent attention mechanism to enhance expressive power by integrating personalized and temporal features and significantly reduce KV cache costs. Comprehensive experiments conducted on several benchmark datasets demonstrate that PT-Recformer consistently outperforms state-of-the-art baselines with only 12\% of the KV cache storage required by vanilla self-attention based models. Online A/B testing on a commercial platform has validated the effectiveness of the PT-Recformer in large-scale industrial settings, resulting in a 3.13\% increase in effective Cost Per Mille. Our codes are available at \url{https://anonymous.4open.science/r/GRec-B386}.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 6715
Loading