Scaling Hierarchical Coreference with Homomorphic Compression

Michael Wick, Swetasudha Panda, Joseph Tassarotti, Jean-Baptiste Tristan

Nov 17, 2018 AKBC 2019 Conference Blind Submission readers: everyone Show Bibtex
  • Keywords: coreference, entity resolution, CRF, LSH, hashing, random projections
  • TL;DR: We employ linear homomorphic compression schemes to represent the sufficient statistics of a conditional random field model of coreference and this allows us to scale inference and improve speed by an order of magnitude.
  • Abstract: Locality sensitive hashing schemes such as \simhash provide compact representations of multisets from which similarity can be estimated. However, in certain applications, we need to estimate the similarity of dynamically changing sets. In this case, we need the representation to be a homomorphism so that the hash of unions and differences of sets can be computed directly from the hashes of operands. We propose two representations that have this property for cosine similarity (an extension of \simhash and angle-preserving random projections), and make substantial progress on a third representation for Jaccard similarity (an extension of \minhash). We employ these hashes to compress the sufficient statistics of a conditional random field (CRF) coreference model and study how this compression affects our ability to compute similarities as entities are split and merged during inference. \cut{We study these hashes in a conditional random field (CRF) hierarchical coreference model in order to compute the similarity of entities as they are merged and split during inference.} We also provide novel statistical analysis of \simhash to help justify it as an estimator inside a CRF, showing that the bias and variance reduce quickly with the number of bits. On a problem of author coreference, we find that our \simhash scheme allows scaling the hierarchical coreference algorithm by an order of magnitude without degrading its statistical performance or the model's coreference accuracy, as long as we employ at least 128 or 256 bits. Angle-preserving random projections further improve the coreference quality, potentially allowing even fewer dimensions to be used.
  • Archival status: Archival
  • Subject areas: Information Integration
0 Replies