Abstract: Re-ranking plays a crucial role in product search by reassessing products from the primary retrieval system based on specific engagement and relevance criteria. While transformer-based models like the cross encoder have advanced the relevance of ranking models in recent years, a significant challenge arises from the high latency cost associated with running a cross encoder model at runtime. This challenge becomes more pronounced in the long-tail segment, where conventional techniques like caching prove ineffective. To tackle these issues, our paper introduces a scalable framework featuring a BERT-based cross encoder model for re-ranking, deployed in the Walmart search engine. We employ strategies such as intermediate representations, operator fusion, and vectorization to improve the inference latency of the cross encoder model. Furthermore, we provide a detailed discussion on the runtime implementation, highlighting key learnings and practical tricks that ensured minimal impact on response latency during production. Finally, we present the results of online experiments, including manual evaluation and interleaving test conducted on real-world e-commerce search traffic.
External IDs:dblp:conf/sigir/Puthenputhussery25
Loading