Uncovering the Latent Relationships in Food Inspection Records via Unsupervised Clustering

Published: 23 May 2026, Last Modified: 01 Jun 2026SD4H ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: public health, food inspection, unsupervised clustering, UMAP, Gaussian Mixture Models
TL;DR: We present the first machine learning study to leverage open food premises inspection data from Toronto Public Health (TPH).
Abstract: Regulatory food premises inspection records encode a structured profile of each establishment's operational characteristics, equipment inventory, and compliance history. However, the administrative risk categories assigned to food premises may not capture the full latent structure present in these records. We present the first machine learning study to leverage open food premises inspection data from Toronto Public Health (TPH), introducing a novel dataset of \textbf{20,055} records that has not previously been used in any machine learning or data-driven research. We develop an unsupervised clustering pipeline in which records are projected to 2-dimensions using \textbf{UMAP} and partitioned using \textbf{Gaussian Mixture Model (GMM)}, with the number of components k selected by minimizing the Bayesian Information Criterion (BIC) over $k \in \{2, \ldots, 10\}$. We selected $k = 5$ to balance statistical fit with interpretability. Cluster quality is assessed against the regulatory three-level risk label (Low, Moderate, High) using Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI). Post-hoc analysis reveals five operationally coherent establishment phenotypes, providing a complementary basis for inspection beyond rule-based risk scoring.
Submission Number: 66
Loading