Abstract: Analyzing large web archive collections incurs high computational costs. We propose an analytic framework based on Apache Hive and SparkSQL with integrated data storage and processing. This method achieves a more balanced performance on typical web archives analysis tasks from searching, filtering, extracting to deriving.
Loading