Abstract: We tackle the longstanding question of checking hypotheses on the social Web. In particular, we address the challenges that arise in the context of testing an input hypothesis on many data samples, in our case, user groups. This is referred to as Multiple Hypothesis Testing, a method of choice for data-driven discoveries. Ensuring sound discoveries in large datasets poses two challenges: the likelihood of accepting a hypothesis by chance, i.e., returning false discoveries, and the pitfall of not being representative of the input data. We develop GroupTest, a framework for group testing that addresses both challenges. We formulate CoverTest, a generic top-n problem that seeks n user groups satisfying one-sample, two-sample, or multiple-sample tests, and maximizing data coverage. We show the hardness of CoverTest and develop a greedy algorithm with a provable approximation guarantee as well as a faster heuristic-based algorithm based on α-investing. Our extensive experiments on four real-world datasets demonstrate the necessity to optimize coverage for sound data-driven discoveries, and the efficiency of our heuristic-based algorithm.
Loading