Web Archive Analysis Using Hive and SparkSQL

Xinyue Wang, Zhiwu Xie

Published: 2019, Last Modified: 13 May 2025JCDL 2019EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Analyzing large web archive collections incurs high computational costs. We propose an analytic framework based on Apache Hive and SparkSQL with integrated data storage and processing. This method achieves a more balanced performance on typical web archives analysis tasks from searching, filtering, extracting to deriving.