Precision Without Labels: Detecting Cross-Applicants in Mortgage Data Using Unsupervised Learning

Hadi Elzayn; Simon Freyaldenhoven; Minchul Shin

Precision Without Labels: Detecting Cross-Applicants in Mortgage Data Using Unsupervised Learning

Hadi Elzayn, Simon Freyaldenhoven, Minchul Shin

19 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Unsupervised learning, Label-free model selection & evaluation, Precision-Recall lower bounds, Fairness analysis, Mortgage lending

TL;DR: Label-free model selection via structural constraints; scalable unsupervised linkage; theory gives precision/recall bounds; 92.3% implied precision on HMDA.

Abstract: We propose a novel method for evaluating unsupervised anonymous record linkage without requiring labeled training data. We derive observable lower bounds on both precision and relative recall by exploiting a common structural constraint that limits how many positive outcomes a single individual can have. This enables principled tuning and comparison of label-generating models without labeled training data. We demonstrate the method on Home Mortgage Disclosure Act (HMDA) data, using a clustering algorithm to detect loan applicants who submit multiple applications (“cross-applicants”) in a dataset lacking personal identifiers. Our preferred specification identifies cross-applicants with 92.3\% precision with only minimal loss in relative recall.

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 21244

Loading