Abstract: Given a million escort advertisements, how can we spot near-duplicates? Such micro-clusters of ads are usually signals of human trafficking (HT). How can we summarize them to convince law enforcement to act? Spotting micro-clusters of near-duplicate documents is useful in multiple, additional settings, including spam-bot detection in Twitter ads, plagiarism, and more. We present InfoShield, which makes the following contributions: practical, being scalable and effective on real data; parameter-free and principled, requiring no user-defined parameters; interpretable, finding a document to be the cluster representative, highlighting all the common phrases, and automatically detecting “slots” (i.e., phrases that differ in every document); and generalizable, beating or matching domain-specific methods in Twitter bot detection and HT detection, respectively, as well as being language independent. Interpretability is particularly important for the anti-HT domain, where law enforcement must visually inspect ads. Our experiments on real data show that InfoShield correctly identifies Twitter bots with an F1 score over 90% and detects HT ads with 84% precision. Moreover, it is scalable, requiring about 8 hours for 4 million documents on a stock laptop. Our incremental version, DeltaShield, allows for fast, incremental updates, with minor loss of accuracy.
0 Replies
Loading