Abstract: Recommendation algorithms have become increasingly important in many online platforms such as online education, TikTok, YouTube Shorts, advertising platforms, etc. Multiarmed bandit (MAB) [2] is a classic problem which can model these recommendation systems. Each arm in MAB corresponds to a specific type of item in the recommendation system. The recommendation of an item of the <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$i\text{th}$</tex> type is regarded as a pull of arm <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$a_{i}$</tex> . Taking recommending short videos as an example, each arm <tex xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">$a_{i}$</tex> represents a class of similar videos (e.g. videos from the same dancer). For simplicity, we assume the reward is 1 if the user likes the recommended item and is 0 otherwise. In a traditional MAB problem, the learner can continue to play the arms with the goal of maximizing the average reward, which either assumes a single user stays in the system for a long period of time or assumes the learner is recommending a single item to each user with a large number of users. While this traditional MAB formulation models recommendation systems such as online advertising well, there are new recommendation systems that are significantly different from these traditional models. In these new recommendation systems, such as TikTok or ALEKS, the learner continuously recommends videos/contents to a user, and the user, other than like or dislike the item, may abandon the system if the recommended items cannot engage the user, and come back later. For example, a user watches TikTok or YouTube Shorts for some period of time, where the duration depends on how interesting/engaging the videos are, then leaves the systems, and comes back later.
0 Replies
Loading