Abstract: Web page categorization has been extensively studied in the literature and has been successfully used to improve information retrieval, recommendation, personalization and ad targeting. With the new industry trend of not tracking users’ online behavior without their explicit permission, using contextual targeting to accurately understand web pages in order to display ads that are topically relevant to the pages becomes more important. This is challenging, however, because an ad request only contains the URL of a web page. As a result, there is very limited available text for making accurate classifications. In this paper, we propose a unified multilingual model that can seamlessly classify web pages in 5 high-impact languages using either their full content or just their URLs with limited text. We adopt multiple data sampling techniques to increase coverage for rare categories in our training corpus, and modify the loss using class-based re-weighting to smooth the influence of frequent versus rare categories. We also propose using an ensemble of teacher models for knowledge distillation and explore different ways to create a teacher ensemble. Offline evaluation shows at least 2.6% improvement in mean average precision across 5 languages compared to a URL classification model trained with single-teacher knowledge distillation. The unified model for both full-content and URL-only input further improves the mean average precision of the dedicated URL classification model by 0.6%. We launched the proposed models, which achieve at least 37% better mean average precision than the legacy tree-based models, for contextual targeting in the Yahoo Demand Side Platform, leading to a significant ad delivery and revenue increase.
External IDs:dblp:journals/tkde/YeBOATPA24
Loading