Abstract: In this paper, we present TICC, an automatic data compression component that can transparently eliminate data redundancies across columns in column-oriented database systems. We further propose two approaches to integrate inter-column compression into existing database systems. One approach is to use User Defined Functions (UDFs), and the other is native. We implement these two approaches on top of Hive based on the ORC file, a common data format in column stores, and evaluate the performance of TICC using real-world datasets. The experimental results demonstrate that TICC can significantly reduce the storage overhead and process a variety of queries over large-scale data with up to 20% performance improvement over the original Hive.
0 Replies
Loading