ERiC-UP$^3$ Benchmark: E-Commerce Risk Intelligence Classifier for Detecting Infringements Based on Utility Patent and Product Pairs

23 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Benchmark; Product-Patent Infringement Detection; Large-scale Multi-Modality Dataset; Contrastive Learning; Retrieval; Domain Gap
Abstract: Innovation is a key driver of economic and social progress, with Intellectual Property (IP) protection through patents playing a crucial role in safeguarding new creations. For businesses actively producing goods, detecting potential patent infringement is vital to avoid costly litigation and operational disruptions. However, the significant domain gap between products and patents—coupled with the vast scale of existing patent databases—makes infringement detection a complex and challenging task. Besides, the machine learning (ML) community has not widely addressed this problem, partly due to the lack of comprehensive datasets tailored for this task. In this paper, we firstly formulate a new task: detecting potentially infringing patents for a given product represented by multi-modal data, including images and textual descriptions. This task requires a deep understanding of both technical and legal contexts, extending beyond simple text or image matching to assess functional similarities that may not be immediately apparent. To promote research in this challenging area, we further introduce the ERiC-UP$^3$ ($\textbf{E}$-commerce $\textbf{R}$isk $\textbf{i}$ntelligence $\textbf{C}$lassifier on $\textbf{U}$tility $\textbf{P}$atent $\textbf{P}$roduct $\textbf{P}$air) benchmark, a large-scale, well-structured dataset comprising over 13-million patent samples and 1 million product samples. It includes 11,000 meticulously annotated infringement pairs for training and 2,000 for testing, all rigorously reviewed by patent experts to ensure high-quality annotations. The dataset reflects real-world scenarios with its multi-modal nature and the necessity for deep functional understanding, offering unique characteristics that set it apart from existing resources. As a case study, we provide results from a series of baseline methods and propose a simple yet effective infringement detection pipeline. We also explore additional approaches that may enhance detection performance, such as text style rewriting, cross-modal matching effectiveness, and image domain alignment. Overall, the ERiC-UP$^3$ benchmark is the first strictly annotated product-patent infringement detection dataset and stands as the largest multi-modal patent dataset, as well as one of the largest multi-modal product datasets available. We aim to advance research extending language and multi-modal models to diverse and dynamic real-world data distributions, fostering innovation and practical solutions in IP infringement detection.
Primary Area: datasets and benchmarks
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2924
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview