Cohort query processing without misleading aging effects

Published: 2025, Last Modified: 21 Jan 2026VLDB J. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Cohort analysis is a powerful method widely employed in fields like medicine and sociology studying how groups of individuals behave over time under different initial conditions. Business data analysts seek to apply this method to massive Internet user activity datasets, aiming to uncover the impact of user attributes and application changes on behavior. However, cohort analysis queries are difficult to specify and costly to execute in traditional database systems. Moreover, the variability in how users allocate time both within and outside Internet applications introduces inconsistencies in user experiences, even among those with identical usage durations. This disparity leads to individual aging bias, distorting the estimation of aging effects on user behavior. To address these challenges, we propose extending database systems to facilitate cohort analysis while mitigating individual aging bias. Central to our approach is a general cohort analysis schema that integrates “constraint-chain-based age". This novel concept effectively eliminates aging bias by aligning the age of user activities with their unique application experiences. Building on this schema, we develop two distinct evaluation implementations for cohort query processing. The first, an intrusive implementation, extends SQL with custom operators that natively support constraint-chain-based age calculation. This implementation leverages a columnar evaluation framework optimized for cohort queries, incorporating advanced techniques like redundant sub-operator coalescence to enhance query plan optimization. The second, a non-intrusive implementation, is designed for rapid deployment on existing relational databases (row-store or column-store) using standard SQL, providing a baseline for performance and usability comparison. Extensive experiments validate the effectiveness of our approach, demonstrating that the intrusive implementation achieves substantial performance gains over the non-intrusive alternative. These results highlight the advantages of embedding native cohort query support into database systems.
Loading