Web Archive Analysis Using Hive and SparkSQL

Published: 01 Jan 2019, Last Modified: 13 May 2025JCDL 2019EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Analyzing large web archive collections incurs high computational costs. We propose an analytic framework based on Apache Hive and SparkSQL with integrated data storage and processing. This method achieves a more balanced performance on typical web archives analysis tasks from searching, filtering, extracting to deriving.
Loading