Abstract: An ads ecosystem needs robust, scalable mechanisms to safeguard
users from bad quality ads. Contemporary ad creatives typically
contain different combinations of modalities like text, images and
video, and as such, any system that flags bad quality ad content
needs a holistic multimodal representation of the ad. In this paper,
we demonstrate that modern Transformer based neural network
models are effective multimodal learners. We report significant
performance gains in YouTube video ads on the task of content
quality prediction by transitioning to Transformer based models
from simpler feed-forward neural networks. We provide ablation
studies to understand the impact of each input modality, and compare various flavors of Transformer architectures. We hope that
our experiments help practitioners looking to incorporate these
powerful multimodal models into other parts of the ads ecosystem.
0 Replies
Loading