Token-Efficient Long-Term Interest Sketching and Internalized Reasoning for LLM-based Recommendation
Keywords: LLM-based Recommendation, Rating Prediction, Efficient Reasoning
TL;DR: We propose an LLM4Rec framework that effectively handles long user histories while enabling efficient inference.
Abstract: Large language models (LLMs) can solve complex real-world tasks when prompted to generate chain-of-thought (CoT) reasoning, motivating their use for preference reasoning in recommender systems. However, applying LLM reasoning on recommendation faces two practical challenges. First, LLMs struggle to reason over long, noisy user histories that often span hundreds of items while truncation discards signals needed to capture long-term interests. Second, in decoder-only architectures, CoT requires generating rationale tokens autoregressively, leading to prohibitive inference latency for real-world deployment. To address the challenges, we propose SIREN, a framework that enables effective LLM-based rating prediction via long-term interest sketching and internalized reasoning. First, instead of prompting raw histories, we build a compact, token-bounded interest sketch that preserves persistent preferences and suppresses noise. Specifically, we encode and cluster item descriptions to discover semantic topics, then compress each user’s history into a short list of liked and disliked topics, facilitating LLM reasoning.
Second, we develop an internalized reasoning strategy for efficient inference. We adopt a two-stage training paradigm: (i) train the LLM to reason explicitly for rating prediction with rule-based reinforcement learning, since ground-truth CoTs are unavailable in recommendation; and (ii) learn to internalize CoT into model parameters through hidden alignment. At inference, the LLM directly generates the rating with near-CoT quality.
Extensive experiments show that SIREN reduces average input tokens by $48.7\%$ compared to raw-history prompting, outperforms existing methods while delivering over $100\times$ lower inference latency than CoT-based LLM recommenders. Code and data are available at https://anonymous.4open.science/r/LLM4Rec-C7CF.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 17207
Loading