Linking Datasets on Organizations Using Half A Billion Open Collaborated RecordsDownload PDFOpen Website

20 May 2022OpenReview Archive Direct UploadReaders: Everyone
Abstract: Scholars studying organizations would like to work with multiple datasets lacking shared unique identifiers or covariates. In such situations, researchers often turn to approximate string matching methods to combine datasets. String matching, although useful, faces funda- mental challenges. String distance metrics are static, and do not explicitly maximize match probabilities. Moreover, many entities have multiple names that are dissimilar (e.g., “Fannie Mae” and “Federal National Mortgage Association”). This paper proposes two approaches to leveraging the massive amount of human-collaborated data from an employment-related social networking site (LinkedIn) to address this problem. The first approach builds a machine learning model for predicting matches, treating the trillion user-contributed orga- nizational name pairs as a training corpus. The second approach uses community detection and treats user records as a network. We document substantial improvements over fuzzy matching in three organization name matching exercises. We make our methods available in an open-source R package (LinkOrgs).
0 Replies

Loading