['3c3', "< Abstract: Sequential recommender systems (SRS) aim to predict users' subsequent choices based on their historical interactions and have found applications in diverse fields such as e-commerce and social media. However, in real-world systems, most users interact with only a handful of items, while the majority of items are seldom consumed. These two issues, known as the long-tail user and long-tail item challenges, often pose difficulties for existing SRS. These challenges can adversely affect user experience and seller benefits, making them crucial to address. Though a few works have addressed the challenges, they still struggle with the seesaw or noisy issues due to the intrinsic scarcity of interactions. The advancements in large language models (LLMs) present a promising solution to these problems from a semantic perspective. As one of the pioneers in this field, we propose the Large Language Models Enhancement framework for Sequential Recommendation (LLM-ESR). This framework utilizes semantic embeddings derived from LLMs to enhance SRS without adding extra inference load from LLMs. To address the long-tail item challenge, we design a dual-view modeling framework that combines semantics from LLMs and collaborative signals from conventional SRS. For the long-tail user challenge, we propose a retrieval augmented self-distillation method to enhance user preference representation using more informative interactions from similar users. To verify the effectiveness and versatility of our proposed enhancement framework, we conduct extensive experiments on three real-world datasets using three popular SRS models. The results show that our method surpasses existing baselines consistently, and benefits long-tail users and items especially. The implementation code is available at https://github.", '---', '> Abstract: Sequential Recommender Systems (SRS) are critical for predicting user preferences based on historical interactions, with broad applications in e-commerce and social media. However, they frequently encounter two significant challenges: the long-tail user problem (users with limited interactions) and the long-tail item problem (infrequently consumed items). These issues severely impact user experience and business metrics due to interaction sparsity, leading to suboptimal recommendations and the "seesaw" effect in existing solutions. Leveraging the powerful semantic understanding capabilities of Large Language Models (LLMs), we propose the Large Language Models Enhancement framework for Sequential Recommendation (LLM-ESR). LLM-ESR effectively mitigates these long-tail challenges by integrating pre-cached LLM-derived semantic embeddings, thereby incurring no additional inference load. Specifically, for long-tail items, we introduce a novel dual-view modeling framework that synergistically combines rich LLM semantics with traditional collaborative signals. For long-tail users, we develop a retrieval-augmented self-distillation method to enrich user preference representations by transferring knowledge from similar, more informative user interactions. Through extensive experiments on three real-world datasets and three popular SRS backbone models, LLM-ESR consistently outperforms state-of-the-art baselines, demonstrating superior effectiveness and versatility, particularly for long-tail scenarios. The implementation code is available at https://github.com/Applied-Machine-Learning-Lab/LLM-ESR.', '6,9c6,9', '< The objective of sequential recommendation is to predict the next likely item for users based on their historical records [7,54]. Owing to its wide-ranging applicability in various domains such as ecommerce [48] and social media [5], sequential recommendation has garnered considerable attention in recent years. Given that the essence of sequential recommendation revolves around extracting user preferences from their interaction records, several innovative architectures have been proposed. For  to note that the proposed framework is model-agnostic, allowing it to be adapted to any sequential recommendation model. The contributions of this paper are as follows:', '< • We propose a large language models enhancement framework, which can alleviate both long-tail user and item challenges for SRS by introducing semantic information from LLMs.', '< • To avoid the inference burden of LLMs, we design an embedding-based enhancement method. Besides, the derived embeddings are utilized directly to retain the original semantic relations.', '< • We conduct extensive experiments on three real-world datasets with three backbone SRS models to validate the effectiveness and flexibility of LLM-ESR.', '---', '> The objective of sequential recommendation is to predict the next likely item for users based on their historical records [7,54]. Owing to its wide-ranging applicability in various domains such as ecommerce [48] and social media [5], sequential recommendation has garnered considerable attention in recent years. Given that the essence of sequential recommendation revolves around extracting user preferences from their interaction records, several innovative architectures have been proposed. Notably, the proposed framework is model-agnostic, allowing it to be adapted to any sequential recommendation model. The contributions of this paper are as follows:', '> • We introduce LLM-ESR, a novel large language model enhancement framework that effectively alleviates both long-tail user and item challenges in SRS through a dual-view modeling approach and retrieval-augmented self-distillation, by integrating rich semantic information from LLMs.', '> • We design an efficient embedding-based enhancement method that pre-caches LLM-derived semantic embeddings, thereby avoiding additional inference burden and uniquely preserving the original semantic relations by freezing the semantic embedding layer.', '> • We conduct extensive experiments on three real-world datasets with three popular SRS backbone models, demonstrating the superior effectiveness and broad flexibility of LLM-ESR compared to existing baselines, especially for long-tail scenarios.', '26c26', '< Semantic-view Modeling. In general, the attributes and descriptions of items contain abundant semantics. To utilize the powerful semantic understanding abilities of LLMs, we organize the attributes and descriptions into textual prompts (the template of prompts can be found in Appendix A.1). Then, in avoid of possible inference burden brought by LLMs, we cache the embeddings derived from LLMs for usage. In specific, the embeddings can be obtained by taking out the last hidden state of open-sourced LLMs, such as LLaMA [51], or the public API, such as text-embedding-ada-002 2 .', '---', '> Semantic-view Modeling. In general, the attributes and descriptions of items contain abundant semantics. To utilize the powerful semantic understanding abilities of LLMs, we organize the attributes and descriptions into textual prompts (the template of prompts can be found in Appendix A.1). To circumvent the potential inference burden imposed by LLMs, we pre-cache the embeddings derived from LLMs for efficient usage. In specific, the embeddings can be obtained by taking out the last hidden state of open-sourced LLMs, such as LLaMA [51], or the public API, such as text-embedding-ada-002 2 .', '58c58', '< Train. Based on the illustration in Section 3.2 and Section 3.3, we only update the collaborative embedding layer, adapter, cross-attention and sequence encoder during the training, while freezing the semantic embedding layer and semantic user base. Since the original LLMs embeddings E se and U llm are frozen, the original semantic relations get preserved well. The training loss for optimization is the combination of pairwise ranking loss and self-distillation loss, which can be written as follows:', '---', '> Train. Based on the illustration in Section 3.2 and Section 3.3, we only update the collaborative embedding layer, adapter, cross-attention and sequence encoder during the training, while freezing the semantic embedding layer and semantic user base. By freezing the original LLM embeddings E se and U llm, we ensure the preservation of their inherent semantic relations throughout training. The training loss for optimization is the combination of pairwise ranking loss and self-distillation loss, which can be written as follows:', '74c74', '< To validate the effectiveness and flexibility of the proposed LLM-ESR, we show the overall, tail and head performance on three datasets in Table 1. At a glance, we find that the proposed LLM-ESR can outperform all competing baselines with all SRS models across all user or item groups, which verifies the usefulness of our framework. Then, we probe more conclusions by the following analysis. H@10 N@10 H@10 N@10 H@10 N@10 H@10 N@10 H@10 N@10 MELT is devised to enhance tail user, but underperforms our method because of its limitations to collaborative perspective. Though LLMInit can also benefit tail users by introducing semantics, it ignores the utilization of LLMs from the user side.', '---', '> To validate the effectiveness and flexibility of the proposed LLM-ESR, we show the overall, tail and head performance on three datasets in Table 1. At a glance, we find that the proposed LLM-ESR can outperform all competing baselines with all SRS models across all user or item groups, which verifies the usefulness of our framework. Then, we probe more conclusions by the following analysis. MELT is devised to enhance tail user, but underperforms our method because of its limitations to a collaborative perspective. Though LLMInit can also benefit tail users by introducing semantics, it ignores the utilization of LLMs from the user side.', '139c139', '< At the same time, it is risky to overfit with semantic embeddings when the textual data is scarce. To validate the robustness of our LLM-ESR, we conduct additional experiments in scenarios with limited textual data. To simulate this situation, we removed all attributes from the item descriptions except for "name" and "categories" when constructing the textual prompts for the Yelp dataset (originally using 8 attributes). This reduced the average word count of the textual prompts from 38.38 to 20.33. We used SASRec as the backbone model in these supplementary experiments, with results presented Table 6: The experiments for limited text and the design of freezing semantic embedding. All the experiments are conducted on the Yelp dataset and for LLM-ESR. "Full" and "Crop" mean that we use the completed item prompt and attribute-cropped prompt to get the LLM embeddings, respectively. "w/o F" means that we train the LLM-ESR without freezing the semantic embedding layer.', '---', '> At the same time, it is risky to overfit with semantic embeddings when the textual data is scarce. To validate the robustness of our LLM-ESR, we conduct additional experiments in scenarios with limited textual data. To simulate this situation, we removed all attributes from the item descriptions except for "name" and "categories" when constructing the textual prompts for the Yelp dataset (originally using 8 attributes). This reduced the average word count of the textual prompts from 38.38 to 20.33. We used SASRec as the backbone model in these supplementary experiments, with results presented in Table 6. All the experiments are conducted on the Yelp dataset and for LLM-ESR. "Full" and "Crop" mean that we use the completed item prompt and attribute-cropped prompt to get the LLM embeddings, respectively. "w/o F" means that we train the LLM-ESR without freezing the semantic embedding layer.', '142c142', '< Tail Item Head Item Tail User Head User H@10 N@10 H@10 N@10 H@10 N@10 H@10 N@10 H@10 N@10 Full 0.6673 0.4208 0.1893 0.0845 0.8080 0.5199 0.6685 0.4229 0.6627 0.4128 Full w/o F 0.6069 0.3664 0.1284 0.0541 0.7477 0.4584 0.6028 0.3647 0.6226 0.3730 Crop 0.6477 0.4046 0.1478 0.0675 0.7807 0.4968 0.6468 0.4058 0.6511 0.3998 Crop w/o F 0.6025 0.3630 0.1247 0.0563 0.7432 0.4563 0.6004 0.3615 0.6109 0.3786 in Table 6. In the table, Full and Crop represent the use of the complete and cropped prompts, respectively. w/o F denotes training LLM-ESR without freezing the semantic embedding layer. The results show a decrease in performance for both Full and Crop due to the limited textual prompt. Moreover, Full w/o F and Crop w/o F yield similar results, indicating that semantic embeddings suffer from overfitting with both complete and cropped prompts. In contrast, freezing the semantic embedding layer improves performance in both scenarios and significantly benefits long-tail items, demonstrating that our design effectively alleviates the overfitting issue.', '---', '> Tail Item Head Item Tail User Head User H@10 N@10 H@10 N@10 H@10 N@10 H@10 N@10 H@10 N@10 Full 0.6673 0.4208 0.1893 0.0845 0.8080 0.5199 0.6685 0.4229 0.6627 0.4128 Full w/o F 0.6069 0.3664 0.1284 0.0541 0.7477 0.4584 0.6028 0.3647 0.6226 0.3730 Crop 0.6477 0.4046 0.1478 0.0675 0.7807 0.4968 0.6468 0.4058 0.6511 0.3998 Crop w/o F 0.6025 0.3630 0.1247 0.0563 0.7432 0.4563 0.6004 0.3615 0.6109 0.3786 In this table, Full and Crop represent the use of the complete and cropped prompts, respectively, and w/o F denotes training LLM-ESR without freezing the semantic embedding layer. The results show a decrease in performance for both Full and Crop due to the limited textual prompt. Moreover, Full w/o F and Crop w/o F yield similar results, indicating that semantic embeddings suffer from overfitting with both complete and cropped prompts. In contrast, freezing the semantic embedding layer improves performance in both scenarios and significantly benefits long-tail items, demonstrating that our design effectively alleviates the overfitting issue.', '148c148', '< For a more meticulous analysis of to what extent the proposed LLM-ESR alleviates the long-tail challenges, we categorize users and items into 5 groups. The performances of each method with Bert4Rec and GRU4Rec as backbone models are shown in Table 6 and Tabel 7, respectively. Firstly, we analyze the results in different user groups. Undoubtedly, all methods perform worse for those users with fewer interactions, which highlights the long-tail user challenge. MELT can enhance the Bert4Rec well so that the performances in all groups get increased, but is incompatible with GRU4Rec and thus harms several groups. By comparison, LLMInit and our LLM-ESR can benefit all user groups consistently. Due to the better utilization of semantics from LLMs, LLM-ESR can outperform LLMInit evidently. Besides, the superiority is larger for more long-tailed users, i.e., 1-4 and 5-9  ', '---', '> For a more meticulous analysis of to what extent the proposed LLM-ESR alleviates the long-tail challenges, we categorize users and items into 5 groups. The performances of each method with Bert4Rec and GRU4Rec as backbone models are shown in Table 6 and Table 7, respectively. Firstly, we analyze the results in different user groups. Undoubtedly, all methods perform worse for those users with fewer interactions, which highlights the long-tail user challenge. MELT can enhance the Bert4Rec well so that the performances in all groups get increased, but is incompatible with GRU4Rec and thus harms several groups. By comparison, LLMInit and our LLM-ESR can benefit all user groups consistently. Due to the better utilization of semantics from LLMs, LLM-ESR can outperform LLMInit evidently. Besides, the superiority is larger for more long-tailed users, i.e., 1-4 and 5-9', '449d448', '< ']
