Abstract: Scholars studying organizations would like to work with multiple datasets lacking shared
unique identifiers or covariates. In such situations, researchers often turn to approximate
string matching methods to combine datasets. String matching, although useful, faces funda-
mental challenges. String distance metrics are static, and do not explicitly maximize match
probabilities. Moreover, many entities have multiple names that are dissimilar (e.g., “Fannie
Mae” and “Federal National Mortgage Association”). This paper proposes two approaches
to leveraging the massive amount of human-collaborated data from an employment-related
social networking site (LinkedIn) to address this problem. The first approach builds a
machine learning model for predicting matches, treating the trillion user-contributed orga-
nizational name pairs as a training corpus. The second approach uses community detection
and treats user records as a network. We document substantial improvements over fuzzy
matching in three organization name matching exercises. We make our methods available
in an open-source R package (LinkOrgs).
0 Replies
Loading