Precision Without Labels: Detecting Cross-Applicants in Mortgage Data Using Unsupervised Learning

ICLR 2026 Conference Submission21244 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Unsupervised learning, Label-free model selection & evaluation, Precision-Recall lower bounds, Fairness analysis, Mortgage lending
TL;DR: Label-free model selection via structural constraints; scalable unsupervised linkage; theory gives precision/recall bounds; 92.3% implied precision on HMDA.
Abstract: We propose a novel method for evaluating unsupervised anonymous record linkage without requiring labeled training data. We derive observable lower bounds on both precision and relative recall by exploiting a common structural constraint that limits how many positive outcomes a single individual can have. This enables principled tuning and comparison of label-generating models without labeled training data. We demonstrate the method on Home Mortgage Disclosure Act (HMDA) data, using a clustering algorithm to detect loan applicants who submit multiple applications (“cross-applicants”) in a dataset lacking personal identifiers. Our preferred specification identifies cross-applicants with 92.3\% precision with only minimal loss in relative recall.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 21244
Loading