Count information: retrieving and estimating cardinality of entity sets from the web

Shrestha Ghosh

Published: 2024, Last Modified: 14 Jan 2026undefined 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Extracting information from the Web remains a critical component in knowledge harvesting systems for building curated knowledge structures, such as Knowledge Bases (KBs), and satisfying evolving user needs, which require operations such as aggregation and reasoning. Estimating the cardinality of a set of entities on the Web to fulfill the information need of questions of the form “how many ..?” is a challenging task. While, intuitively, cardinality can be estimated by explicitly enumerating the constituent entities, this is usually not possible due to the low recall of entities on the Web. We present our contributions towards retrieving and estimating cardinalities of entity sets on the Web: • We propose a method, CounQER, for discovering count information in KBs. We identify interpretable classes of features to classify KB predicates that store counts and enumerations. Further, we devise heuristics to align semantically-related counts and enumerations to each other. CounQER is also accessible as a system demonstration. • We propose a method, CoQEx, to infer count distribution from multiple text snippets. Co- QEx is trained using distant supervision to identify relevant counts and predicts the final result via weighted median. CoQEx provides explanatory evidence by forming semantic groups of the contexts, by ranking exemplary instances and by provenance of the counts in the originating snippets. CoQEx is also available online as a system demonstration. • We tackle the problem of predicting the larger of two sets of entities, when direct comparison of the counts may give incorrect results. We emulate a smart human’s approach and introduce a variety of online signals that can be applied to solve the problem. We propose novel techniques for aggregating signals with partial coverage into more reliable estimates on which of the two given classes has more instances. • We propose, CardiO, a lightweight and modular framework for estimating cardinalities on the Web. CardiO scores counts based on the relevance of their context to the expected answer type, the relevance of the parent sentence and snippet to the user query. CardiO leverages supporting facts to re-score the counts for the final prediction. Further, CardiO identifies relevant peer sets to predict the cardinality of the original entity set.

External IDs:dblp:phd/dnb/Ghosh24