Approximating Naive Bayes on Unlabelled Categorical Data

Published: 15 Sept 2023, Last Modified: 15 Sept 2023Accepted by TMLREveryoneRevisionsBibTeX
Abstract: We address the question of binary classification when no labels are available and the input features are categorical. The lack of labels means supervised approaches can't be used, and the lack of a natural distance measure means that most unsupervised methods do poorly. For such problems, where the alternatives might be a) do nothing or b) heuristic rules-based approaches, we offer a third alternative: a classifier that approximates Naive Bayes. Our primary scenarios are those that involve distinguishing scripted, or bot, web traffic from that of legitimate users. Our main assumption is the existence of some attribute $x_*$ more prevalent in the benign than the scripted traffic; i.e., $P(x_*|\overline{\mbox{bot}}) = K \cdot P(x_*|\mbox{bot}),$ for $K>1.$ We show that any such disparity yields a lower bound on $P(\mbox{bot}|x_{j})$ even when we have no prior estimates of $P(x_*|\overline{\mbox{bot}}),$ $P(x_*|\mbox{bot})$ or $K$ (except that $K>1$). We show that when at least one bin of at least one feature receives no attack traffic then we under-estimate the actual conditional probability by a factor of $1-1/K.$ Thus, any attribute with a large disparity between prevalence in benign and abuse traffic (i.e., $K$ is large), allows good approximation of the Naive Bayes classifier without the benefit of labels. The approach is particularly suited to problems where $K$ is high and thus the approximation is very accurate. Example problems (and relevant attributes) might be: password-guessing, if login attempts from legitimate users succeed at a much higher rate than those from password-guessing attackers; Credit Card Verification Value (CVV) guessing, if an attacker exhaustively tries all possible 3 or 4-digit values and fails at a higher rate than legitimate users; account registration, if legitimate users use email addresses from services that do not allow fee anonymous accounts (e.g., {\tt .edu}) at a much higher rate than attackers; click-fraud if legitimate users visit pages and services that contain no ads at a higher rate than click-fraud bots.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: In response to the AE's comments we have added a discussion of the unlabelled classification scheme of Kaji and Sugiyama to Related work. We have also added an evaluation on simulated data in Section 7, showing that our approximation indeed matches the performance of true Naive Bayes.
Assigned Action Editor: ~Tongliang_Liu1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 998