Title: LLM-ESR: Large Language Models Enhancement for Long-tailed Sequential Recommendation

Abstract: Sequential Recommender Systems (SRS) are critical for predicting user preferences based on historical interactions, with broad applications in e-commerce and social media. However, they frequently encounter two significant challenges: the long-tail user problem (users with limited interactions) and the long-tail item problem (infrequently consumed items). These issues severely impact user experience and business metrics due to interaction sparsity, leading to suboptimal recommendations and the "seesaw" effect in existing solutions. Leveraging the powerful semantic understanding capabilities of Large Language Models (LLMs), we propose the Large Language Models Enhancement framework for Sequential Recommendation (LLM-ESR). LLM-ESR effectively mitigates these long-tail challenges by integrating pre-cached LLM-derived semantic embeddings, thereby incurring no additional inference load. Specifically, for long-tail items, we introduce a novel dual-view modeling framework that synergistically combines rich LLM semantics with traditional collaborative signals. For long-tail users, we develop a retrieval-augmented self-distillation method to enrich user preference representations by transferring knowledge from similar, more informative user interactions. Through extensive experiments on three real-world datasets and three popular SRS backbone models, LLM-ESR consistently outperforms state-of-the-art baselines, demonstrating superior effectiveness and versatility, particularly for long-tail scenarios. The implementation code is available at https://github.com/Applied-Machine-Learning-Lab/LLM-ESR.

Section: Introduction
The objective of sequential recommendation is to predict the next likely item for users based on their historical records [7,54]. Owing to its wide-ranging applicability in various domains such as ecommerce [48] and social media [5], sequential recommendation has garnered considerable attention in recent years. Given that the essence of sequential recommendation revolves around extracting user preferences from their interaction records, several innovative architectures have been proposed. Notably, the proposed framework is model-agnostic, allowing it to be adapted to any sequential recommendation model. The contributions of this paper are as follows:
• We introduce LLM-ESR, a novel large language model enhancement framework that effectively alleviates both long-tail user and item challenges in SRS through a dual-view modeling approach and retrieval-augmented self-distillation, by integrating rich semantic information from LLMs.
• We design an efficient embedding-based enhancement method that pre-caches LLM-derived semantic embeddings, thereby avoiding additional inference burden and uniquely preserving the original semantic relations by freezing the semantic embedding layer.
• We conduct extensive experiments on three real-world datasets with three popular SRS backbone models, demonstrating the superior effectiveness and broad flexibility of LLM-ESR compared to existing baselines, especially for long-tail scenarios.

Section: Problem Definition
The goal of the sequential recommendation is to give out the next item that users are possible to interact with based on their interaction records. The set of users and items are denoted as U = {u 1 , . . . , u i , . . . , u |U | } and V = {v 1 , . . . , v i , . . . , v |V| }, respectively, where |U| and |V| are the number of users and items. Each user has an interaction sequence, which arranges the interacted items by timeline, denoted as S u = {v
(u) 1 , . . . , v(u)
i , . . . , v
nu }. n u represents the interaction number of user u. For simplicity, we omit the superscript (u) in the following sections. Then, the problem of sequential recommendation can be defined as follows:
arg max vi∈V P (v nu+1 = v i |S u )(1)
Following the existing works related to long-tailed SRS [17,20], we can split the users and items into tail and head groups. Let n u and p v denote the length of the user's interaction sequence and the popularity of the item v (i.e., the total interaction number). Firstly, we sort the users and items by the values of n u and p v in descending order. Then, take out the top 20% users and items as head user and head item according to Pareto principle [4], denoted as U head and V head . The rest of the users and items are the tail user and tail item, i.e., U tail = U \ U head and V tail = V \ V head . To alleviate the long-tail challenges, we aim to elevate the recommending performance for U tail and V tail .

Section: LLM-ESR 3.1 Overview
The overview of the proposed LLM-ESR is shown in Figure 2. To acquire the semantic information, we adopt LLMs to encode textual users' historical interactions and items' attributes into LLMs user embedding and LLMs item embedding. Then, two modules are proposed to augment long-tail items and long-tail users, respectively, i.e., Dual-view Modeling and Retrieval Augmented Self-Distillation. i) Dual-view Modeling: This module consists of two branches. One is semantic-view modeling, which aims to extract the semantic information from the user's interaction sequence. It first utilizes the semantic embedding layer, derived from LLMs item embedding, to encode the items.
Then, an adapter is designed for dimension adaptation and space transformation. The output item embedding sequence will be fed into cross-attention for fusion and then sequence encoder to get the user representation in semantic view. The other branch is collaborative-view modeling, which transforms the interaction sequence into an embedding one by a collaborative embedding layer.
Next, followed by a cross-attention and the sequence encoder, the collaborative user preference is obtained. At the end of this module, the user representations in the two views will be fused for the final recommendations. ii) Retrieval Augmented Self-Distillation: This module expects to enhance long-tail users through informative interactions of similar users. First, the derived LLMs user embedding is considered as a semantic user base for retrieving similar users. Then, similar users are fed into dual-view modeling to get their user representations, which are the guide signal for self-distillation. Finally, the derived distillation loss will be utilized as an auxiliary loss for training.

Section: Dual-view Modeling
The traditional SRS models are skilled in capturing collaborative signals, which can recommend for popular items well [20,17]. However, they compromise on long-tail items due to the lack of semantics [2]. Therefore, we model the preferences of users from the dual views to cover all items simultaneously. Besides, we propose a two-level fusion to better combine the benefits from both two.
Semantic-view Modeling. In general, the attributes and descriptions of items contain abundant semantics. To utilize the powerful semantic understanding abilities of LLMs, we organize the attributes and descriptions into textual prompts (the template of prompts can be found in Appendix A.1). To circumvent the potential inference burden imposed by LLMs, we pre-cache the embeddings derived from LLMs for efficient usage. In specific, the embeddings can be obtained by taking out the last hidden state of open-sourced LLMs, such as LLaMA [51], or the public API, such as text-embedding-ada-002 2 .
We adopt the latter one in this paper. Let E se ∈ R |V|×d llm denotes the LLMs embedding of all items, where d llm is dimension of LLMs embedding. Then, the semantic embedding layer E se from LLMs can be used for semantic-view modeling to enhance long-tail items. However, previous works [13,16] often adapt it as the initialization of the item embedding layer, which may ruin the original semantic relations during fine-tuning. In order to retain the semantics, we freeze the E se and propose an adapter to transform the raw semantic space into the recommending space. For each item i, we can get its LLMs embedding e llm i by taking the i-th row of E se . Then, it will be fed into the tunable adapter to get the semantic embedding:
e se i = W a 2 (W a 1 e llm i + b a 1 ) + b a 2(2)
where
W a 1 ∈ R d llm 2 ×d llm , W a 2 ∈ R d× d llm 2 and b a 1 ∈ R d llm 2 ×1 , b a 2 ∈ R d×1
are the weight matrices and bias of adapter. Following this process, we can obtain the item embedding sequence of the user's interaction records, denoted as S se = [e se 1 , . . . , e se nu ]. Similar to a general SRS model, we employ a sequence encoder f θ (e.g., self-attention layers [52] for SASRec [18]) to get the representation of user preference in semantic view as follows:
u se = f θ (S se )(3)
where u se ∈ R d×1 is the user preference representation in semantic view and θ denotes the parameters of sequence encoder in an SRS model.
Collaborative-view Modeling. To utilize the collaborative information, we adopt a trainable item embedding layer and supervised update it by interaction data. Let E co ∈ R |V|×d denotes the collaborative embedding layer of the item. Then, the item embedding sequence S co = [e co 1 , . . . , e co nu ] is acquired by extracting the corresponding rows from E co . To get the user preference u co in the collaborative view, we input embedding sequence to sequence encoder, i.e., u co = f θ (S co ). It is worth noting that, the sequence encoder f θ is the same one in both semantic and collaborative views for the shared sequential pattern and higher efficiency [46]. Besides, the embedding layers in the two views are in unbalanced training stages (one is pretrained, while the other is from scratch), which may lead to optimization difficulty [1]. To handle such a problem, we initialize the E co by dimension-reduced E se . The Principal Component Analysis (PCA) [44] is used as the dimension reduction method in this paper.
Two-level Fusion. The effective integration of both semantic-view and collaborative-view is essential to absorb the benefits of these two. However, the direct merge of the user representations in dual views may overlook the nuanced inter-relationships between item sequences. Thus, we design a two-level fusion method for the dual-view modeling module, i.e., sequence-level and logit-level. The former aims to implicitly capture the mutual relationships between the item sequences of dual views, while the latter explicitly targets the combination of recommending abilities. In specific, we propose a cross-attention mechanism for sequence-level fusion. To simplify the description, we only take the semantic view interacting with the collaborative view for illustration, and the other view is the same. Specifically, S se is considered as the query, and S co as the key and value in attention mechanism. Let
Q = S se W Q , K = S co W K , V = S co W V , where W Q , W K , W V ∈ R d×d are weight matrices.
Then, the interacted collaborative embedding sequence can be formulated as follows:
Ŝco = Softmax( QK T √ d )V(4)
Following the same process of cross-attention, we can also get the corresponding semantic embedding sequence Ŝse . Finally, S se , S co are substituted by Ŝse , Ŝse to be fed into f θ (•). As for logit-level fusion, we concatenate the two-view user and item embeddings for recommendation. The probability score of recommending item j for the user u is therefore calculated as:
P (v nu+1 = v j |v 1:nu ) = [e se j : e co j ] T [u se : u co ](5)
where ":" denotes the concatenation operation of two vectors. Based on the probability score, we adopt the pairwise ranking loss to train the framework:
L Rank = - u∈U nu k=1 logσ(P (v + k+1 = |v 1:k ) -P (v - k+1 = |v 1:k ))(6)
where v + k+1 and v - k+1 are the ground-truth item and paired negative item. It is worth noting that the ranking loss may differ a little according to different backbone SRS models, e.g., sequence-to-one pairwise loss for GRU4Rec [14].

Section: Retrieval Augmented Self-Distillation
The long-tail user problem originates from the lack of enough interactions for the sequence encoder in an SRS to capture users' preferences. Thus, we propose a self-distillation method to augment the extraction capacity of the sequence encoder. Self-distillation [12,61] is a type of knowledge distillation that considers one model as both the student and teacher for model enhancement. As for the SRS, since multiple similar users have more informative interactions, it is promising to transfer their knowledge to the target user for strengthening. Thereafter, there are two key challenges for such knowledge transfer, i.e., how to retrieve similar users and how to transfer the knowledge.
Retrieve Similar Users. Previous works have confirmed that LLMs can understand the semantic meanings of textual user interaction records for recommendation [25,10]. Based on their observation, we organize the item's title that interacted by users into the textual prompts (the template of prompts can be found in Appendix A.1). Then, similar to the derivation of LLMs item embedding E se , we can obtain and save the LLMs user embedding, denoted as U llm ∈ R |U |×d llm . It is also dubbed as the semantic user base in this paper, because the semantic relations are encoded in it. For each target user k, we can retrieve the similar user set U k as follows:
U k = Top({cos(u llm k , u llm j )} |U | j=1 , N )(7)
where cos(•, •) is the cosine similarity function to measure the distance between two vectors. N represents the size of similar user sets, which is a hyper-parameter.
Self-Distillation. As mentioned before, we design the self-distillation to transfer the knowledge from several similar users to the target user. Since the representation of user preference, i.e., u se and u co , encode the comprehensive knowledge of the user, we configure such representation as the mediator for the distillation. To get the teacher mediator, we first utilize the dual-view modeling framework (Section 3.2) to get the user representation for each similar user, denoted as {u se j , u co j }
|U k | j=1 . Then, the teacher mediator is calculated by mean pooling, as the following formula:
[u se T k : u co T k ] = Mean_Pooling({[u se j : u co j ]} |U k | j=1 )(8)
The student mediator is the representation of target user k, i.e., [u se k : u co k ]. Based on the teacher and student mediators, the self-distillation loss can be formulated as:
L SD = 1 |U| |U | k=1 |[u se k : u co k ] -[u se T k : u co T k ]| 2(9)
Note that the gradients of u se T k and u co T k are stopped, because they only provide the guidance signal instead of optimizing the model.

Section: Train and Inference
Train. Based on the illustration in Section 3.2 and Section 3.3, we only update the collaborative embedding layer, adapter, cross-attention and sequence encoder during the training, while freezing the semantic embedding layer and semantic user base. By freezing the original LLM embeddings E se and U llm, we ensure the preservation of their inherent semantic relations throughout training. The training loss for optimization is the combination of pairwise ranking loss and self-distillation loss, which can be written as follows:
L = L Rank + α • L SD (10
)
where α is a hyper-parameter to adjust the magnitude of self-distillation.
Inference. During the inference process of the LLM-ESR, the retrieval augmented self-distillation module is exempted due to no need for the auxiliary loss. Thus, we follow the dual-view modeling process for the final recommendation by Equation (5). Besides, since the semantic embedding layer can be cached in advance, the call for LLMs is avoided, which prevents the extra inference costs. Due to the limited space, the algorithm lies in Appendix A.2 for more clarity.

Section: Experiment


Section: Experimental Settings
Dataset. There are three real-world datasets applied for evaluation, i.e., Yelp, Amazon Fashion and Amazon Beauty. We follow the previous SRS works [18,50] for preprocessing and data split. More details about the datasets and preprocessing can be seen in Appendix B.1.
Baselines. To validate the flexibility, we combine the competing baselines and LLM-ESR with three well-known backbone SRS models: GRU4Rec [14], Bert4Rec [49] and SASRec [18]. Then, two groups of baselines are compared in the experiments. One group is the traditional enhancement framework for the long-tailed sequential recommendation, including CITIES [17] and MELT [20]. The other group is the LLM-based enhancement framework, which contains RLMRec [45] and LLMInit [13,16]. The more details about baselines are put into Appendix B.2.
Implementation Details. The hardware used in all experiments is an Intel Xeon Gold 6133 platform with Tesla V100 32G GPUs, while the basic software requirements are Python 3.9.5 and PyTorch 1.12.0. The hyper-parameters N and α are searched from {2, 6, 10, 14, 18} and {1, 0.5, 0.1, 0.05, 0.01}. More details about the implementation details are in Appendix B.3. The implementation code is available at https://github.com/Applied-Machine-Learning-Lab/LLM-ESR.
Evaluation Metrics. In the experiments, we adopt the metrics of Top-10 list for evaluation. Specifically, the Hit Rate (H@10) and Normalized Discounted Cumulative Gain (N@10) are used. Following [18], we randomly sample 100 items that the user has not interacted with as the negatives paired with the ground truth for calculation of the metrics. To guarantee the robustness of the experimental results, we report the average results of the triplicate test with random seeds {42, 43, 44}.

Section: Overall Performance
To validate the effectiveness and flexibility of the proposed LLM-ESR, we show the overall, tail and head performance on three datasets in Table 1. At a glance, we find that the proposed LLM-ESR can outperform all competing baselines with all SRS models across all user or item groups, which verifies the usefulness of our framework. Then, we probe more conclusions by the following analysis. MELT is devised to enhance tail user, but underperforms our method because of its limitations to a collaborative perspective. Though LLMInit can also benefit tail users by introducing semantics, it ignores the utilization of LLMs from the user side.
Flexibility. Table 1 shows that the proposed framework can get the largest performance improvements on all three backbone SRS models, which indicates the flexibility of LLM-ESR. By comparison, the other baselines incline to depend on the type of SRS. The traditional method, i.e., CITIES and MELT, tend to perform better for GRU4Rec, while LLMInit is more beneficial to Bert4Rec and SASRec.

Section: Ablation Study
The results of the ablation study are shown in   

Section: Hyper-parameter Analysis
To investigate the effects of the hyper-parameters in LLM-ESR, we show the performance trend along with their changes in Figure 3. The hyper-parameter α controls to what extent the designed selfdistillation affects the optimization. With α ranging from 1 to 0.01, the recommending accuracy rises first and drops then. The reason for the compromised performance of large α lies in that overemphasis on self-distillation will affect the convergence of ranking loss. Smaller α also downgrades the performance, which indicates the usefulness of the designed self-distillation. As for the number of retrieved users N , the best is 10. The reason is that more users can provide more informative interactions. However, too large N may decrease the relatedness of the retrieved users.

Section: Group Analysis
For more meticulous analysis, we split the users and items into 5 groups according to sequence length n u and popularity p v , and show the performance of each group in Figure 4. From the results, we observe that LLM-based frameworks derive increases in every user and item group, while MELT has a positive effect on some specific groups. It reflects the seesaw problem of MLET and reveals the benefit of making use of semantic embeddings from LLMs. Comparing LLMInit with LLM-ESR, LLM-ESR can get more increments on the long-tail groups (e.g., 1-4 user group and 1-9 item group), which proves the better reservation of semantic information from LLMs by our framework. The group analysis of Bert4Rec and GRU4Rec as backbones are shown in Appendix C.3.
5 Related Works

Section: Sequential Recommendation
The core of sequential recommendation refers to capturing the sequence pattern for the next likely item [29,39,31,38,60,24,23,26,40]. Thus, at the early stage, researchers focus on fabricating the architecture to improve model capacity. GRU4Rec [14] and Caser [50] apply RNNs and CNNs [21] for sequence modeling. Later, inspired by the great success of self-attention [52] in natural language processing, SASRec [18] and Bert4Rec [49] verify its potential in SRS. Also, Zhou et al. [66] proposes a pure MLP architecture, achieving similar accuracy but higher efficiency compared with SASRec. Despite the great progress in SRS, long-tail problems are still underexplored. As for the long-tail item problem, CITIES [17] designs an embedding inference function for those long-tail items specially. In terms of the long-tail user problem, data augmentation is the main way [37,34]. Only one work, MELT [20], addresses both two problems simultaneously but still sticks to a collaborative perspective. By comparison, the proposed LLM-ESR handles both the two long-tail problems better from a semantic view by introducing LLMs.

Section: LLMs for Recommendation
Large language models [63,43] have attracted widespread attention due to their powerful abilities in semantic understanding. Recently, There emerge several works to explore how to utilize LLMs in recommender systems (RS) [64,28,22,41,57,58,65,32,30,35], which can be categorized into two lines, i.e., LLMs as RS and LLMs enhancing RS. The first line of research aims to complete recommendation tasks by LLMs directly. At the early stage, researchers tend to fabricate the prompt templates to stimulate the recommending ability of LLMs by dialogues. For example, ChatRec [10] proposes a dialogue process to complete recommendation tasks step by step. DRDT [55] integrates a retrieval-based dynamic reflection process for SRS by in-context learning [6]. LLMRerank [9] and UniLLMRec [62] fabricate the chain-of-thought prompts to target the reranking stage and whole recommendation process, respectively. Besides, some other researchers explore fine-tuning open-sourced LLMs for RS. TALLRec [2] is the first one, which fine-tunes a LLaMA-7B by parameter-efficient fine-tuning techniques [15,33]. Some following works, including E4SRec [25], LLaRA [27] and RecInterpreter [59], target combining collaborative signals into LLMs by modifying the tokenization. However, this line of work faces the challenge of high inference costs. Another line, LLMs enhancing RS, is more practical, because they avoid the use of LLMs while recommending. For instance, RLMRec [45] aligns with LLMs by an auxiliary loss. AlphaRec [47] adopts LLMs embedding to enhance the collaborative filtering models. On the other hand, LLM4MSR [56] and Uni-CTR [8] propose to utilize LLMs to augment the multi-domain recommendation models. As for LLMs enhancing sequential recommendation, Harte et al. [13] and Hu et al. [16] adopt LLMs embedding as the initialization for the traditional models. The proposed LLM-ESR belongs to the latter category but further alleviates the problem of defect of semantic information.

Section: Conclusion
In this paper, we propose a large language model enhancement framework for sequential recommendation (LLM-ESR) to handle the long-tail user and long-tail item challenges. Firstly, we acquire and cache the semantic embeddings derived from LLMs, which is for inference efficiency. Then, a dual-view modeling framework is proposed to combine the semantics from LLMs and collaborative signals contained in the traditional model. It can help augment the long-tail items in SRS. Next, we design the retrieval augmented self-distillation to alleviate the long-tail user challenge. Through the comprehensive experiments, we verify the effectiveness and flexibility of our LLM-ESR.

Section: Acknowledgements
This Get the user preference representation in semantic and collaborative views, i.e., u se and u co , respectively.

Section: 8:
Calculate the probability score of ground-truth and negative items by Equation (5).

Section: 9:
Calculate the ranking loss by Equation ( 6). Retrieve the similar users for each user in U B by Equation (7).

Section: 11:
Calculate the self-distillation loss by Equation ( 9).
12:
Sum the ranking loss and self-distillation loss. Then, update the parameters. Get the user preference representation in semantic and collaborative views, i.e., u se and u co . 17:
Calculate the probability score of each candidate item by Equation ( 5) and give out the final recommended list. 18: end for

Section: B Experimental Settings
In this section, we will refer to more details about the experimental settings.

Section: B.1 Dataset and Preprocessing
The comprehensive experiments in this paper are conducted on three common-used datasets, i.e., Yelp, Fashion and Beauty. Yelp3 is the dataset that records the check-in histories and corresponding reviews of users. We only adopt the check-in data and the attribute information of the point-of-interests. Amazon4 [42] is a large e-commerce dataset, which includes user's reviews on commodities. There are several sub-categories in this dataset and we use two of them, i.e., Fashion and Beauty.
For preprocessing, we refer to the procedures in SASRec [18]. Since the sequential recommendation is often utilized for implicit interactions, we consider all review or rate records as interactions. Then, the users with fewer than three interacted items are dropped, because we do not explore the problem of cold-start users in this paper. As for the data split, the last item v nu and the penultimate item v nu-1 of each interaction sequence are taken out as the test and validation, respectively. The statistics of the three preprocessed datasets are shown in Table 3. 

Section: B.2 Backbone and Baseline
Backbone Models. To show the flexibility of our enhancement method, we test three popular sequential recommendation models in the experiments. The main distinction between these models refers to the sequence encoder f θ and ranking loss L Rank .
• GRU4Rec [14]. It adopts the GRU as the sequence encoder, and sequence-to-one pairwise loss as the final ranking loss.
• Bert4Rec [49]. Inspired by the training pattern of Bert [19], this backbone proposes a combination between pairwise ranking loss and cloze task, which mask a proportion of items in one sequence. The sequence encoder of Bert4Rec is the stack of bi-directional self-attention layers.
• SASRec [18]. Compared with Bert4rec, SASRec adopts the causal self-attention layer as the basic unit of its sequence encoder. Besides, the sequence-to-sequence pairwise ranking loss is applied for optimization during the training.
There are two groups of up-to-date baselines that are compared within this paper, i.e., traditional baselines and LLM-based baselines.
Traditional Baselines. This category split the users and items into long-tail and head groups at first. Then, they enhance the long-tail users or items by fabricated training procedures. Note that they only utilize the collaborative signals essentially and do not introduce any semantics.
• CITIES [17]. This work devises an embedding-inference function to refine the embeddings of long-tail items specially. Such embedding-inference function is trained by head items and used for long-tail items during inference. We follow the hyper-parameters in the original paper and code 5 .
• MELT [20]. MELT proposes a bilateral-branch framework to enhance the long-tail users and items. One branch is trained to generate the head user representations and enhance the tail users while inference. The other branch is to recover the embeddings of head items during training and update embeddings of tail items during inference. We refer to the implementation and the hyper-parameter settings in official code 6 .
LLM-based Baselines. The methods in this line aim to combine the semantic information derived from LLMs to enhance the recommendation models.
• RLMRec [45]. This baseline is one of the pioneers in utilizing the semantic embeddings derived from LLMs. However, it is designed for collaborative filtering but not sequential recommendation. For a fair comparison, we eliminate the process of profile generation during the implementation. We refer to the source code 7 of RLMRec to adapt it to sequential recommendation models.
• LLMInit [13,16]. More recent works, i.e., LLM2Bert4Rec [13] and SAID [16], both utilize the LLMs embedding to initialize the item embedding layer in SRS models and then fine-tune it by interaction data. In this paper, we dub this way as LLMInit.

Section: B.3 Implementation Details
We conduct all experiments on an Intel Xeon Gold 6133 platform with Tesla V100 32G GPUs. Besides, the implementation is based on Python 3.9.5 and PyTorch 1.12.0. In terms of the hyperparameter search, the criterion is N@10 on the validation set. To avoid overfitting, we adopt the early stop strategy with 20-epoch patience. For the backbone SRS models, the number of GRU layers is set to 1 for GRU4Rec, while the number of self-attention layers is fixed at 2 for SASRec and Bert4Rec. Also, the dropout rate is 0.6 for Bert4Rec. In terms of the training, the batch size and learning rate are set as 128 and 0.001 for all datasets. The embedding size is 128 for all baselines, while 64 for LLM-ESR. The reason is that there are two branches in LLM-ESR, and the half size of the other unique-branch baseline is a fair setting. Then, we choose the Adam as the optimizer. The hyperparameters N and α for LLM-ESR are searched from {2, 6, 10, 14, 18} and {1, 0.5, 0.1, 0.05, 0.01}. We find that the best choice is 10 for N and 0.1 for α for all three datasets used in this paper. H@10 N@10 H@10 N@10 H@10 N@10 H@10 N@10 H@10 N@10 H@10 N@10 H@10 N@10 H@10 N@10 H@10 N@10 H@10 N@10 Furthermore, the embeddings of LLMs are derived from the API 8 named "text-ada-embedding-002" provided by OpenAI.

Section: C More Experimental Results
In this section, we will show more experimental results to further analyze the flexibility and effectiveness of our LLM-ESR.

Section: C.1 Ablation Study
For further analysis, we conduct the ablation study on the proposed LLM-ESR with Bert4Rec and GRU4Rec as the backbone SRS models. The results are shown in Table 4 and Table 5. At first, we probe the effects of dual-view modeling by removing one of the views, denoted as w/o Co-view and w/o Se-view. From the overall performance, these two variants both underperform, which indicates the essence of the dual-view. Besides, w/o Co-view downgrades the accuracy of the head item group more, while w/o So-view harms the long-tail item group compared with LLM-ESR. This phenomenon highlights the advantages of the collaborative view and semantic view, respectively. As for distinct SRS backbone models, we find that Bert4Rec benefits more from collaborative information, because removing the collaborative view causes a more severe performance drop. By comparison, GRU4Rec can get more enhancement from the semantic view. Then, w/o SD means eliminating self-distillation. It downgrades the performance of the tail user group consistently, which indicates the proposed retrieval augmented self-distillation can actually help alleviate the long-tail user challenge. w/o Share represents using separate sequence encoders for the dual views. This variant is a little worse than applying a shared encoder, illustrating the common pattern for both views. Another advantage of the shared encoder is higher parameter efficiency. Besides, LLM-ESR without cross-attention (w/o CA) is inferior to LLM-ESR totally, which indicates the effectiveness of the sequence-level fusion.
At the same time, it is risky to overfit with semantic embeddings when the textual data is scarce. To validate the robustness of our LLM-ESR, we conduct additional experiments in scenarios with limited textual data. To simulate this situation, we removed all attributes from the item descriptions except for "name" and "categories" when constructing the textual prompts for the Yelp dataset (originally using 8 attributes). This reduced the average word count of the textual prompts from 38.38 to 20.33. We used SASRec as the backbone model in these supplementary experiments, with results presented in Table 6. All the experiments are conducted on the Yelp dataset and for LLM-ESR. "Full" and "Crop" mean that we use the completed item prompt and attribute-cropped prompt to get the LLM embeddings, respectively. "w/o F" means that we train the LLM-ESR without freezing the semantic embedding layer.

Section: Model Overall
Tail Item Head Item Tail User Head User H@10 N@10 H@10 N@10 H@10 N@10 H@10 N@10 H@10 N@10 Full 0.6673 0.4208 0.1893 0.0845 0.8080 0.5199 0.6685 0.4229 0.6627 0.4128 Full w/o F 0.6069 0.3664 0.1284 0.0541 0.7477 0.4584 0.6028 0.3647 0.6226 0.3730 Crop 0.6477 0.4046 0.1478 0.0675 0.7807 0.4968 0.6468 0.4058 0.6511 0.3998 Crop w/o F 0.6025 0.3630 0.1247 0.0563 0.7432 0.4563 0.6004 0.3615 0.6109 0.3786 In this table, Full and Crop represent the use of the complete and cropped prompts, respectively, and w/o F denotes training LLM-ESR without freezing the semantic embedding layer. The results show a decrease in performance for both Full and Crop due to the limited textual prompt. Moreover, Full w/o F and Crop w/o F yield similar results, indicating that semantic embeddings suffer from overfitting with both complete and cropped prompts. In contrast, freezing the semantic embedding layer improves performance in both scenarios and significantly benefits long-tail items, demonstrating that our design effectively alleviates the overfitting issue.

Section: C.2 Visualization
To further investigate how LLMs enhance the traditional SRS models, we visualize the item embeddings of SASRec, CITIES, MELT, our LLM-ESR (concatenate the semantic embedding e se and collaborative embedding e co ), and LLM using t-SNE, as shown in Figure 5. We group the items into four categories based on their popularity. The t-SNE figures reveal that the embeddings of SASRec, CITIES, and MELT tend to cluster according to item popularity. In contrast, the distribution of LLM embeddings is more uniform, indicating that the semantic relationships are not skewed by popularity. Furthermore, the embeddings of our LLM-ESR also show a more even distribution, validating that our method effectively corrects the embedding distribution in SRS and thus can enhance the performance of long-tail items.  

Section: C.3 Group Analysis
For a more meticulous analysis of to what extent the proposed LLM-ESR alleviates the long-tail challenges, we categorize users and items into 5 groups. The performances of each method with Bert4Rec and GRU4Rec as backbone models are shown in Table 6 and Table 7, respectively. Firstly, we analyze the results in different user groups. Undoubtedly, all methods perform worse for those users with fewer interactions, which highlights the long-tail user challenge. MELT can enhance the Bert4Rec well so that the performances in all groups get increased, but is incompatible with GRU4Rec and thus harms several groups. By comparison, LLMInit and our LLM-ESR can benefit all user groups consistently. Due to the better utilization of semantics from LLMs, LLM-ESR can outperform LLMInit evidently. Besides, the superiority is larger for more long-tailed users, i.e., 1-4 and 5-9

Section: D Limitation
Two potential limitations should be considered for this paper. Firstly, there are two hyper-parameters for the proposed LLM-ESR, i.e., the weight of self-distillation loss α and the number of retrieved similar users N , which is time-consuming to search for the best model. Secondly, only the LLMs embedding provided by OpenAI API is validated in the experiments, but other more recent models [3,53] may lead to better performance. Nonetheless, the experiments on various datasets and backbone models consistently validate the effectiveness of our LLM-ESR NeurIPS Paper Checklist

Section: Claims
Question: Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope?
Answer: [Yes] Justification: The contributions and scope of the paper are included in the abstract and Section 1. Please refer to the first and last paragraph of Section 1 for scope and contributions, respectively.
Guidelines:
• The answer NA means that the abstract and introduction do not include the claims made in the paper. • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers. • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings. • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.

Section: Limitations
Question: Does the paper discuss the limitations of the work performed by the authors?
Answer: [Yes] Justification: A limitation section is included in the appendix (Section D).
Guidelines:
• The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper. • The authors are encouraged to create a separate "Limitations" section in their paper.
• The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be. • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated. • The authors should reflect on the factors that influence the performance of the approach.
For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon. • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size. • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness. • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren't acknowledged in the paper. The authors should use their best judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.

Section: Theory Assumptions and Proofs
Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
Answer: [Yes] Justification: We have attached the data and code used in this paper in the supplementary material.
Guidelines:
• The answer NA means that paper does not include experiments requiring code. • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.

Section: Experimental Setting/Details
Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
Answer: [Yes] Justification: We provide the details of the experimental settings, such as the data split, optimizer, etc., in the experimental setting section (Section 4.1) in the main paper and the implementation detail section (Section B.3) in the appendix.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them. • The full details can be provided either with the code, in appendix, or as supplemental material.

Section: Experiment Statistical Significance
Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
Answer: [Yes] Justification: We report the two-sided t-test with p < 0.05 results in the main experiments, i.e., Table 1.
Guidelines:
• The answer NA means that the paper does not include experiments.
• The authors should answer "Yes" if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.
• The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions). • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.) • The assumptions made should be given (e.g., Normally distributed errors). • It should be clear whether the error bar is the standard deviation or the standard error of the mean. Justification: We discuss the potential positive impacts that our algorithm will bring in the Introduction section, i.e., Section 1 Guidelines:
• The answer NA means that there is no societal impact of the work performed.
• If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations. • The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).

Section: Safeguards
Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?

Section: Answer: [NA]
Justification: There is no risk of misuse of the proposed method and the datasets used in the paper are open-sourced.
Guidelines:
• The answer NA means that the paper poses no such risks.
• Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
Answer: [Yes]
Justification: We have cited the original paper or attached the link to the existing assets used in this paper.
Guidelines:
• The answer NA means that the paper does not use existing assets.
• The authors should cite the original paper that produced the code package or dataset.
• The authors should state which version of the asset is used and, if possible, include a URL. • The name of the license (e.g., CC-BY 4.0) should be included for each asset.
• For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided. • If this information is not available online, the authors are encouraged to reach out to the asset's creators.

Section: New Assets
Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
Answer: [Yes]
Justification: We have attached the introduction of how to run the code and the license in the code repository.
Guidelines:
• The answer NA means that the paper does not release new assets.
• Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc. • The paper should discuss whether and how consent was obtained from people whose asset is used. • At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.

Section: Crowdsourcing and Research with Human Subjects
Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
Answer: [NA] Justification: This paper does not involve crowdsourcing or research with human subjects.
Guidelines:
• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects. • Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper. • According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.

Section: Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects
Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
Answer: [NA] Justification: This paper does not involve crowdsourcing or research with human subjects.
Guidelines:
• The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
• Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper. • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution. • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.

Section: A Supplement to Method
In this section, the details of prompt design and the procedures of LLM-ESR are addressed.

Section: A.1 Prompt Design
In Section 3.2 and Section 3.3, we format the attributes of items and historical interactions of users into textual prompts, for their semantic embeddings by LLMs. During the process of constructing prompts, the templates play a vital role. Here, the templates are listed as follows.
Item Prompt Template. The templates mainly organize the attributes and descriptions of items, which vary across distinct datasets due to different recorded attributes. In the following templates, the words underlined are the corresponding attributes that will be filled in.

Section: Item Prompt Template (Yelp)
The point of interest has the following attributes: name is <NAME>; category is <CATEGORY>; type is <TYPE>; open status is <OPEN>; review count is <COUNT>; city is <CITY>; average score is <STARS>.

Section: Item Prompt Template (Fashion)
The fashion item has the following attributes: name is <TITLE>; brand is <BRAND>; score is <DATE>; price is <PRICE>. The item has the following features: <FEATURE>. The item has the following descriptions: <DESCRIPTION>.

Section: Item Prompt Template (Beauty)
The beauty item has the following attributes: name is <TITLE>; brand is <BRAND>; price is <PRICE>. The item has the following features: <CATEGORIES>. The item has the following descriptions: <DESCRIPTION>.
User Prompt Template. This template mainly organizes the items that the user has interacted with. To utilize the semantic information and avoid excess of the limitation of input length, the item in the prompt is represented by its title. Besides, the three datasets share a unique template.

Section: User Prompt Template
The user has visited the following items: <ITEM1_TITLE>, <ITEM2_TITLE>, ... please conclude the user's preference.

Section: A.2 Train and Inference Process
For a clearer illustration of the training and inference process, we conclude them in Algorithm 1. First, the hyper-parameters and backbone SRS model are specified (lines 1-3). Then, organize the attributes of items and historical interactions into textual prompts to get their semantic embeddings (line 4). At the beginning of the training, we initialize the embedding layers in the dual-view framework (line 5). Next, calculate the ranking loss by dual-view modeling (lines 7-9) and auxiliary loss by retrieval augmented self-distillation (lines [10][11]. Through the sum of these two losses (line 12), we can optimize the whole LLM-ESR. During the inference, only the dual-view modeling process is conducted to get the final recommendations (lines [16][17].

Section: Answer: [NA]
Justification: There is no theoretical result in this paper. Guidelines:
• The answer NA means that the paper does not include theoretical results.
• All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced. • All assumptions should be clearly stated or referenced in the statement of any theorems.
• The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition. • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material. • Theorems and Lemmas that the proof relies upon should be properly referenced.

Section: Experimental Result Reproducibility
Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)? Answer: [Yes] Justification: We introduce the details of the experiment, such as the information on hardware and software, in the implementation detail section, i.e., Section B.3, in the appendix. Besides, we also release the code to ease the reproducibility. Guidelines:
• The answer NA means that the paper does not include experiments.
• If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not. • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable. • Depending on the contribution, reproducibility can be accomplished in various ways.
For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed. • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. , with an open-source dataset or instructions for how to construct the dataset). (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility.
In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results. 5. Open access to data and code


References:
[b0] I Amos; J Berant; A Gupta (2023). Never train from scratch: Fair comparison of long-sequence models requires data-driven priors. 
[b1] K Bao; J Zhang; Y Zhang; W Wang; F Feng; X He (2023). Tallrec: An effective and efficient tuning framework to align large language model with recommendation. 
[b2] P Behnamghader; V Adlakha; M Mosbach; D Bahdanau; N Chapados; S Reddy (2024). Llm2vec: Large language models are secretly powerful text encoders. 
[b3] G E Box; R D Meyer (1986). An analysis for unreplicated fractional factorials. Technometrics
[b4] J Chang; C Gao; Y Zheng; Y Hui; Y Niu; Y Song; D Jin; Y Li (2021). Sequential recommendation with graph neural networks. 
[b5] Q Dong; L Li; D Dai; C Zheng; Z Wu; B Chang; X Sun; J Xu; Z Sui (2022). A survey on in-context learning. 
[b6] H Fang; D Zhang; Y Shu; G Guo (2020). Deep learning for sequential recommendation: Algorithms, influential factors, and evaluations. ACM Transactions on Information Systems (TOIS)
[b7] Z Fu; X Li; C Wu; Y Wang; K Dong; X Zhao; M Zhao; H Guo; R Tang (2023). A unified framework for multi-domain ctr prediction via large language models. ACM Transactions on Information Systems
[b8] J Gao; B Chen; X Zhao; W Liu; X Li; Y Wang; Z Zhang; W Wang; Y Ye; S Lin (2024). Llm-enhanced reranking in recommender systems. 
[b9] Y Gao; T Sheng; Y Xiang; Y Xiong; H Wang; J Zhang (2023). Chat-rec: Towards interactive and explainable llms-augmented recommender system. 
[b10] B Geng; Z Huan; X Zhang; Y He; L Zhang; F Yuan; J Zhou; L Mo (2024). Breaking the length barrier: Llm-enhanced ctr prediction in long textual user behaviors. 
[b11] J Gou; B Yu; S J Maybank; D Tao (2021). Knowledge distillation: A survey. International Journal of Computer Vision
[b12] J Harte; W Zorgdrager; P Louridas; A Katsifodimos; D Jannach; M Fragkoulis (2023). Leveraging large language models for sequential recommendation. 
[b13] B Hidasi; A Karatzoglou; L Baltrunas; D Tikk (2016). Session-based recommendations with recurrent neural networks. 
[b14] E J Hu; P Wallis; Z Allen-Zhu; Y Li; S Wang; L Wang; W Chen (2021). Low-rank adaptation of large language models. 
[b15] J Hu; W Xia; X Zhang; C Fu; W Wu; Z Huan; A Li; Z Tang; J Zhou (2024). Enhancing sequential recommendation via llm-based semantic embedding learning. 
[b16] S Jang; H Lee; H Cho; S Chung (2020). Cities: Contextual inference of tail-item embeddings for sequential recommendation. IEEE
[b17] W.-C Kang; J Mcauley (2018). Self-attentive sequential recommendation. IEEE
[b18] J D ; M.-W C Kenton; L K Toutanova (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. 
[b19] K Kim; D Hyun; S Yun; C Park (2023). Melt: Mutual enhancement of long-tailed user and item for sequential recommendation. 
[b20] Y Lecun; L Bottou; Y Bengio; P Haffner (1998). Gradient-based learning applied to document recognition. 
[b21] L Li; Y Zhang; D Liu; L Chen (2023). Large language models for generative recommendation: A survey and visionary discussions. 
[b22] M Li; Z Zhang; X Zhao; W Wang; M Zhao; R Wu; R Guo (2023). Automlp: Automated mlp for sequential recommendations. 
[b23] M Li; X Zhao; C Lyu; M Zhao; R Wu; R Guo (2022). Mlp4rec: A pure mlp architecture for sequential recommendations. International Joint Conferences on Artificial Intelligence
[b24] X Li; C Chen; X Zhao; Y Zhang; C Xing (2023). E4srec: An elegant effective efficient extensible solution of large language models for sequential recommendation. 
[b25] J Liang; X Zhao; M Li; Z Zhang; W Wang; H Liu; Z Liu (2023). Mmmlp: Multi-modal multilayer perceptron for sequential recommendations. 
[b26] J Liao; S Li; Z Yang; J Wu; Y Yuan; X Wang; X He (2023). Llara: Aligning large language models with sequential recommenders. 
[b27] J Lin; X Dai; Y Xi; W Liu; B Chen; X Li; C Zhu; H Guo; Y Yu; R Tang (2023). How can recommender systems benefit from large language models: A survey. 
[b28] L Liu; L Cai; C Zhang; X Zhao; J Gao; W Wang; Y Lv; W Fan; Y Wang; M He (2023). Linrec: Linear attention mechanism for long-term sequential recommender systems. 
[b29] Q Liu; J Hu; Y Xiao; X Zhao; J Gao; W Wang; Q Li; J Tang (2024). Multimodal recommender systems: A survey. ACM Computing Surveys
[b30] Q Liu; F Tian; Q Zheng; Q Wang (2023). Disentangling interest and conformity for eliminating popularity bias in session-based recommendation. Knowledge and Information Systems
[b31] Q Liu; X Wu; W Wang; Y Wang; Y Zhu; X Zhao; F Tian; Y Zheng (2024). Large language model empowered embedding generator for sequential recommendation. 
[b32] Q Liu; X Wu; X Zhao; Y Zhu; D Xu; F Tian; Y Zheng (2024). When moe meets llms: Parameter efficient fine-tuning for multi-task medical applications. 
[b33] Q Liu; F Yan; X Zhao; Z Du; H Guo; R Tang; F Tian (2023). Diffusion augmentation for sequential recommendation. 
[b34] Q Liu; X Zhao; Y Wang; Y Wang; Z Zhang; Y Sun; X Li; M Wang; P Jia; C Chen (2024). Large language model enhanced recommender systems: Taxonomy, trend, application and future. 
[b35] S Liu; Y Zheng (2020). Long-tail session-based recommendation. 
[b36] Z Liu; Z Fan; Y Wang; P S Yu (2021). Augmenting sequential recommendation with pseudoprior items via reversely pre-training transformer. 
[b37] Z Liu; Q Liu; Y Wang; W Wang; P Jia; M Wang; Z Liu; Y Chang; X Zhao (2024). Bidirectional gated mamba for sequential recommendation. 
[b38] Z Liu; S Liu; Z Zhang; Q Cai; X Zhao; K Zhao; L Hu; P Jiang; K Gai (2024). Sequential recommendation for optimizing both immediate feedback and long-term retention. 
[b39] Z Liu; J Tian; Q Cai; X Zhao; J Gao; S Liu; D Chen; T He; D Zheng; P Jiang (2023). Multi-task recommendations with reinforcement learning. 
[b40] S Luo; Y Yao; B He; Y Huang; A Zhou; X Zhang; Y Xiao; M Zhan; L Song (2024). Integrating large language models into recommendation via mutual augmentation and adaptive aggregation. 
[b41] J Mcauley; C Targett; Q Shi; A Van Den;  Hengel (2015). Image-based recommendations on styles and substitutes. 
[b42] B Min; H Ross; E Sulem; A P B Veyseh; T H Nguyen; O Sainz; E Agirre; I Heintz; D Roth (2023). Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys
[b43] K Pearson (1901). Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science
[b44] X Ren; W Wei; L Xia; L Su; S Cheng; J Wang; D Yin; C Huang (2024). Representation learning with large language models for recommendation. 
[b45] J Shang; T Ma; C Xiao; J Sun (2019). Pre-training of graph augmented transformers for medication recommendation. International Joint Conferences on Artificial Intelligence
[b46] L Sheng; A Zhang; Y Zhang; Y Chen; X Wang; T.-S Chua (2024). Language models encode collaborative signals in recommendation. 
[b47] U Singer; H Roitman; Y Eshel; A Nus; I Guy; O Levi; I Hasson; E Kiperwasser (2022). Sequential modeling with multiple attributes for watchlist recommendation in e-commerce. 
[b48] F Sun; J Liu; J Wu; C Pei; X Lin; W Ou; P Jiang (2019). Bert4rec: Sequential recommendation with bidirectional encoder representations from transformer. 
[b49] J Tang; K Wang (2018). Personalized top-n sequential recommendation via convolutional sequence embedding. 
[b50] H Touvron; T Lavril; G Izacard; X Martinet; M.-A Lachaux; T Lacroix; B Rozière; N Goyal; E Hambro; F Azhar; A Rodriguez; A Joulin; E Grave; G Lample (2023). Llama: Open and efficient foundation language models. 
[b51] A Vaswani; N Shazeer; N Parmar; J Uszkoreit; L Jones; A N Gomez; Ł Kaiser; I Polosukhin (2017). Attention is all you need. Advances in neural information processing systems
[b52] L Wang; N Yang; X Huang; L Yang; R Majumder; F Wei (2023). Improving text embeddings with large language models. 
[b53] S Wang; L Hu; Y Wang; L Cao; Q Z Sheng; M Orgun (2019). Sequential recommender systems: challenges, progress and prospects. International Joint Conferences on Artificial Intelligence
[b54] Y Wang; Z Liu; J Zhang; W Yao; S Heinecke; P S Yu (2023). Drdt: Dynamic reflection with divergent thinking for llm-based sequential recommendation. 
[b55] Y Wang; Y Wang; Z Fu; X Li; X Zhao; H Guo; R Tang (2024). Llm4msr: An llm-enhanced paradigm for multi-scenario recommendation. 
[b56] L Wu; Z Qiu; Z Zheng; H Zhu; E Chen (2024). Exploring large language model for graph data understanding in online job recommendations. 
[b57] L Wu; Z Zheng; Z Qiu; H Wang; H Gu; T Shen; C Qin; C Zhu; H Zhu; Q Liu (2024). A survey on large language models for recommendation. World Wide Web
[b58] Z Yang; J Wu; Y Luo; J Zhang; Y Yuan; A Zhang; X Wang; X He (2023). Large language model can interpret latent space of sequential recommender. 
[b59] C Zhang; Q Han; R Chen; X Zhao; P Tang; H Song (2024). Ssdrec: Self-augmented sequence denoising for sequential recommendation. 
[b60] L Zhang; J Song; A Gao; J Chen; C Bao; K Ma (2019). Be your own teacher: Improve the performance of convolutional neural networks via self distillation. 
[b61] W Zhang; X Li; Y Wang; K Dong; Y Wang; X Dai; X Zhao; H Guo; R Tang (2024). Tired of plugins? large language models can be end-to-end recommenders. 
[b62] W X Zhao; K Zhou; J Li; T Tang; X Wang; Y Hou; Y Min; B Zhang; J Zhang; Z Dong (2023). A survey of large language models. 
[b63] Z Zhao; W Fan; J Li; Y Liu; X Mei; Y Wang; Z Wen; F Wang; X Zhao; J Tang (2024). Recommender systems in the era of large language models (llms). IEEE Transactions on Knowledge and Data Engineering
[b64] Z Zheng; W Chao; Z Qiu; H Zhu; H Xiong (2024). Harnessing large language models for text-rich sequential recommendation. 
[b65] K Zhou; H Yu; W X Zhao; J.-R Wen (2022). Filter-enhanced mlp is all you need for sequential recommendation. 

Figures:
Figure fig_0: 2
Type: figure
Caption: Figure 2 :2Figure 2: The overview of the proposed LLM-ESR framework.
Data: 

Figure fig_1: 3
Type: figure
Caption: Figure 3 :3Figure 3: The hyper-parameter experiments on the weight of self-distillation loss α and the number of retrieved similar users N . The results are based on the Yelp dataset with the SASRec model.
Data: 

Figure fig_2: 4
Type: figure
Caption: Figure 4 :4Figure 4: The results of the proposed LLM-ESR and competing baselines in meticulous user and item groups. The results are based on the Beauty dataset with the SASRec model.
Data: 

Figure fig_5: 5
Type: figure
Caption: Figure 5 :5Figure 5: The visualization of the item embeddings by t-SNE. The dataset used in the experiments is Yelp. "CITIES", "MELT" and "LLM-ESR" are all based on the SASRec backbone model. "LLM" represents the embeddings derived from LLM, which encodes the semantics of textual item prompts. Different colors of circles shown in the figures mean different popularity groups of the item.
Data: 

Figure fig_6: 6
Type: figure
Caption: Figure 6 :6Figure 6: The results of the proposed LLM-ESR and competing baselines in meticulous user and item groups. The results are based on the Beauty dataset with the Bert4Rec model.
Data: 

Figure fig_7: 7
Type: figure
Caption: Figure 7 :7Figure 7: The results of the proposed LLM-ESR and competing baselines in meticulous user and item groups. The results are based on the Beauty dataset with the GRU4Rec model.
Data: 

Figure tab_1: 1
Type: table
Caption: The overall results of competing baselines and our LLM-ESR. The boldface refers to the highest score and the underline indicates the next best result of the models. "*" indicates the statistically significant improvements (i.e., two-sided t-test with p < 0.05) over the best baseline.
Data: DatasetModelOverallTail ItemHead ItemTail UserHead UserH@10N@10H@10N@10H@10N@10H@10N@10H@10N@10GRU4Rec0.48790.27510.01710.00590.62650.35440.49190.27770.47260.2653-CITIES0.48980.27490.01340.00510.63010.35430.49360.27830.47560.2618-MELT0.49850.28250.02010.00790.63930.36330.50460.28650.47500.2671-RLMRec0.48860.27770.01880.00670.62690.35740.49200.28040.47560.2671-LLMInit0.48720.27490.02010.00720.62460.35370.49080.27750.47320.2647-LLM-ESR 0.5724* 0.3413* 0.0763* 0.0318* 0.7184* 0.4324* 0.5782* 0.3456* 0.5501* 0.3247*YelpBert4Rec -CITIES0.5307 0.52490.3035 0.30150.0115 0.00410.0044 0.00140.6836 0.67830.3916 0.38990.5325 0.52740.3047 0.30320.5241 0.51550.2988 0.2954-MELT0.62060.37700.04290.01490.79070.48360.62100.37800.61910.3733-RLMRec0.53060.30390.01040.00400.69380.39220.53510.30650.51370.2936-LLMInit0.61990.37810.08740.03300.77660.47970.62040.37960.61780.3723-LLM-ESR 0.6623* 0.4222* 0.1227* 0.0500* 0.8212* 0.5318* 0.6637* 0.4247* 0.6571* 0.4127*SASRec0.59400.35970.11420.04950.73530.45110.58930.35780.61220.3672-CITIES0.58280.35400.15320.07000.70930.43760.57850.35110.59940.3649-MELT0.62570.37910.10150.03710.78010.47990.62460.38040.62990.3744-RLMRec0.59900.36230.09530.04120.74740.45680.59660.36130.60840.3658-LLMInit0.64150.39970.17600.07890.77850.49410.64030.40100.64620.3948-LLM-ESR 0.6673* 0.4208* 0.1893* 0.0845* 0.8080* 0.5199* 0.6685* 0.4229* 0.6627* 0.4128*GRU4Rec0.47980.38090.02570.01010.66060.52850.37810.25770.61180.5408-CITIES0.47620.37430.02520.01030.65570.51910.37290.25010.61030.5354-MELT0.48840.39750.02910.01120.67120.55130.38900.27700.61730.5538-RLMRec0.47950.38080.02530.01050.66030.52820.37730.25770.61200.5405-LLMInit0.48640.40950.02500.01040.67020.56840.38520.29730.61770.5550-LLM-ESR 0.5409* 0.4567* 0.0807* 0.0384* 0.7242* 0.6233* 0.4560* 0.3568* 0.6512* 0.5864*FashionBert4Rec -CITIES0.4668 0.49260.3613 0.40900.0142 0.02230.0067 0.00990.6470 0.67990.5024 0.56790.3500 0.39520.2344 0.29750.6183 0.61900.5258 0.5535-MELT0.48970.38100.00590.00190.68230.53190.38420.25140.62660.5491-RLMRec0.47440.35670.00440.00150.66150.49810.36260.22680.61940.5251-LLMInit0.48540.40350.03280.01610.66550.55770.37730.28460.62550.5578-LLM-ESR 0.5487* 0.4529* 0.0525* 0.0225* 0.7462* 0.6243* 0.4629* 0.3460* 0.6599* 0.5916*SASRec0.49560.44290.04540.02350.67480.60990.39670.33900.62390.5777-CITIES0.49230.44230.04070.02140.67210.60980.39360.33920.62030.5760-MELT0.48750.41500.03680.01440.66700.57450.37920.29330.62800.5729-RLMRec0.49820.44570.04100.02230.68030.61430.39900.34150.62700.5808-LLMInit0.51190.44920.05960.03050.69200.61590.41840.35010.63320.5777-LLM-ESR 0.5619* 0.4743* 0.1095* 0.0520* 0.7420* 0.6424* 0.4811* 0.3769* 0.6668* 0.6005*GRU4Rec0.36830.22760.07960.05670.43710.26830.35840.21910.41350.2663-CITIES0.24560.14000.11220.07600.27740.15520.23820.13460.27950.1645-MELT0.37020.21610.00090.00030.45820.26750.36370.21160.39970.2365-RLMRec0.36680.22780.07800.05600.43570.26880.35760.22020.40890.2626-LLMInit0.41510.27130.08960.06370.49280.32080.40590.26210.45710.3133-LLM-ESR 0.4917* 0.3140* 0.1547* 0.0801* 0.5721* 0.3698* 0.4851* 0.3079* 0.5220* 0.3420*BeautyBert4Rec -CITIES0.3984 0.39610.2367 0.23390.0101 0.00230.0038 0.00080.4910 0.49000.2922 0.28950.3851 0.38320.2272 0.22500.4593 0.45510.2801 0.2746-MELT0.47160.29650.07090.02910.56710.36030.45960.28650.52630.3423-RLMRec0.39770.23650.00900.00320.49030.29210.38530.22770.45390.2765-LLMInit0.50290.32090.09270.04510.60070.38670.49190.31170.55300.3632-LLM-ESR 0.5393* 0.3590* 0.1379* 0.0745* 0.6350* 0.4269* 0.5295* 0.3507* 0.5839* 0.3972*SASRec0.43880.30300.08700.06490.52270.35980.42700.29410.49260.3438-CITIES0.22560.14130.13630.08970.24680.15360.22150.14060.24410.1444-MELT0.43340.27750.04600.01720.52580.39950.42330.26730.47960.3241-RLMRec0.44600.30750.09240.06580.53030.36520.43650.30160.48920.3345-LLMInit0.54550.36560.17140.09650.63470.42980.53590.35920.58930.3948-LLM-ESR 0.5672* 0.3713* 0.2257* 0.1108* 0.6486* 0.4334* 0.5581* 0.3643* 0.6087* 0.4032*

Figure tab_2: 2
Type: table
Caption: The ablation study on the Yelp dataset with SASRec as the backbone SRS model. The boldface refers to the highest score and the underline indicates the next best result of the models.
Data: ModelOverallTail ItemHead ItemTail UserHead User

Figure tab_3: 
Type: table
Caption: From the results, we observe that the proposed LLM-ESR leads the overall performance under both two metrics, which indicates better-enhancing effects. LLMInit is often the secondary. This phenomenon shows that the injection of semantics from LLMs actually augments the SRS. However, RLMRec often underperforms compared with other LLM-based methods, because it is devised for collaborative filtering algorithms, incompatible with SRS. As for the traditional baselines, MELT stays ahead in most cases. The reason lies in that it addresses the long-tail user and long-tail item challenges simultaneously. By comparison, CITIES is even sometimes inferior to the backbone SRS model due to the seesaw problem, i.e., drastic drops for popular items.
Data: -LLM-ESR0.6673 0.4208 0.1893 0.0845 0.8080 0.5199 0.6685 0.4229 0.6627 0.4128-w/o Co-view0.6320 0.3816 0.1898 0.0856 0.7621 0.4687 0.6318 0.3823 0.6325 0.3787-w/o Se-view0.6468 0.4038 0.1105 0.0460 0.8047 0.5091 0.6459 0.4043 0.6501 0.4018-w/o SD0.6572 0.4121 0.2003 0.0898 0.7911 0.5071 0.6566 0.4130 0.6574 0.4091-w/o Share0.6595 0.4158 0.1728 0.0783 0.8027 0.5152 0.6606 0.4186 0.6552 0.4055-w/o CA0.6644 0.4160 0.1850 0.0803 0.8004 0.5119 0.6652 0.4175 0.6616 0.41051-layer Adapter 0.6108 0.3713 0.1107 0.0469 0.7580 0.4668 0.6065 0.3702 0.6269 0.3754Random Init0.6440 0.3984 0.1899 0.0839 0.7777 0.4910 0.6454 0.4018 0.6388 0.3853Overall Comparison.

Figure tab_4: 2
Type: table
Caption: Firstly, we remove the collaborative view or semantic view to investigate the dual-view modeling, denoted as w/o Co-view and w/o Se-view. The results show that w/o Co-view downgrades performance dramatically on the head group, while w/o Se-view harms tail items evidently. Such changes indicate the distinct specialty of collaborative and semantic information, highlighting the combination of both. w/o SD means dropping self-distillation, which shows performance drops for long-tail users. It suggests the effects of the proposed retrieval augmented self-distillation. The results of these three variants validate the motivation for designing each component for LLM-ESR. w/o Share and w/o CA represent using split sequence encoder and removing cross-attention. The decrease in performance of these two illustrates the effectiveness of the sharing design and sequence-level fusion. More results can be seen in Appendix C.1. Furthermore, we have two designs to ease the optimization of the entire LLM-ESR framework. One is that we use dimension-reduced LLM item embeddings to initialize the collaborative embedding
Data: 

Figure tab_5: 
Type: table
Caption: Train and inference process of LLM-ESR 1: Indicate the backbone sequential recommendation model f θ . 2: Indicate the number of retrieved similar users N . 3: Indicate the weight of self-distillation loss α. 4: Get the semantic embeddings E se and U llm by LLMs. Train Process 5: Initialize the embedding layers in the dual-view framework by the raw and dimension-reduced E se . Freeze the raw E se . 6: for a batch of users U B in U do
Data: 7:research was partially supported by National Key Research and Development Program of China(2022YFC3303600), National Natural Science Foundation of China (No.62192781, No.62177038,No.62293551, No.62277042, No.62137002, No.61721002, No.61937001, No.62377038), Project ofChina Knowledge Centre for Engineering Science and Technology, "LENOVO-XJTU" IntelligentIndustry Joint Laboratory Project, Research Impact Fund (No.R1015-23), Collaborative ResearchFund (No.C1043-24GF), APRC -CityU New Research Initiatives (No.9610565, Start-up Grant for

Figure tab_6: 
Type: table
Caption: 13: end for Inference Process 14: Load E se for item embedding layers and other trained parameters. 15: for each user u k in U do
Data: 16:

Figure tab_7: 3
Type: table
Caption: The statistics of the preprocessed datasets Dataset # Users # Items Sparsity Avg.length
Data: Yelp15,72011,38399.89%12.23Fashion9,0494,72299.92%3.82Beauty52,20457,28999.99%7.57

Figure tab_8: 4
Type: table
Caption: The ablation study on the Yelp dataset with Bert4Rec as the backbone SRS model. The boldface refers to the highest score and the underline indicates the next best result of the models.
Data: ModelOverallTail ItemHead ItemTail UserHead User

Figure tab_9: 5
Type: table
Caption: The ablation study on the Yelp dataset with GRU4Rec as the backbone SRS model. The boldface refers to the highest score and the underline indicates the next best result of the models.
Data: -LLM-ESR0.6623 0.4222 0.1227 0.0500 0.8212 0.5318 0.6637 0.4247 0.6571 0.4127-w/o Co-view 0.6273 0.3737 0.1272 0.0520 0.7745 0.4684 0.6296 0.3760 0.6184 0.3647-w/o Se-view 0.6521 0.4125 0.0981 0.0395 0.8153 0.5224 0.6533 0.4150 0.6477 0.4031-w/o SD0.6539 0.4114 0.1299 0.0534 0.8081 0.5168 0.6539 0.4129 0.6538 0.4055-w/o Share0.6592 0.4193 0.1182 0.0480 0.8187 0.5276 0.6619 0.4229 0.6482 0.4100-w/o CA0.6368 0.3924 0.0940 0.0369 0.7966 0.4971 0.6369 0.3940 0.6367 0.3862ModelOverallTail ItemHead ItemTail UserHead User

Figure tab_11: 
Type: table
Caption: • Please see the NeurIPS code and data submission guidelines (https://nips.cc/ public/guides/CodeSubmissionPolicy) for more details. • While we encourage the release of code and data, we understand that this might not be possible, so "No" is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark). • The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https: //nips.cc/public/guides/CodeSubmissionPolicy) for more details. • The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc. • The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why. • At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).
Data: 

Figure tab_12: 
Type: table
Caption: • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text. 8. Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments? Answer: [Yes] Justification: We provide the details of compute resources in the implementation detail section (Section B.3) in the appendix. Guidelines: • The answer NA means that the paper does not include experiments. • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper). 9. Code Of Ethics Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines? Answer: [Yes] Justification: We have made sure that our paper conforms with the NeurIPS Code of Ethics. Guidelines: • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics. • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.• The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).
Data: 10. Broader ImpactsQuestion: Does the paper discuss both potential positive societal impacts and negativesocietal impacts of the work performed?Answer: [Yes]


Formulas:
Formula formula_0: (u) 1 , . . . , v(u)

Formula formula_2: arg max vi∈V P (v nu+1 = v i |S u )(1)

Formula formula_3: e se i = W a 2 (W a 1 e llm i + b a 1 ) + b a 2(2)

Formula formula_4: W a 1 ∈ R d llm 2 ×d llm , W a 2 ∈ R d× d llm 2 and b a 1 ∈ R d llm 2 ×1 , b a 2 ∈ R d×1

Formula formula_5: u se = f θ (S se )(3)

Formula formula_6: Q = S se W Q , K = S co W K , V = S co W V , where W Q , W K , W V ∈ R d×d are weight matrices.

Formula formula_7: Ŝco = Softmax( QK T √ d )V(4)

Formula formula_8: P (v nu+1 = v j |v 1:nu ) = [e se j : e co j ] T [u se : u co ](5)

Formula formula_9: L Rank = - u∈U nu k=1 logσ(P (v + k+1 = |v 1:k ) -P (v - k+1 = |v 1:k ))(6)

Formula formula_10: U k = Top({cos(u llm k , u llm j )} |U | j=1 , N )(7)

Formula formula_11: [u se T k : u co T k ] = Mean_Pooling({[u se j : u co j ]} |U k | j=1 )(8)

Formula formula_12: L SD = 1 |U| |U | k=1 |[u se k : u co k ] -[u se T k : u co T k ]| 2(9)

Formula formula_13: L = L Rank + α • L SD (10

Formula formula_14: )
