Big Data Joe Benchmarks: Hive on MapReduce and Hive on Apache Tez
I decided to run some tests of my own comparing Hive on MapReduce and Hive on Apache Tez. I also tried using the ORC (Optimized Row Columnar) file format and enabling Vectorization.
ORC File Format?
The Optimized Row Columnar (ORC) file format provides a highly efficient way to store Hive data. It was designed to overcome limitations of the other Hive file formats. Using ORC files improves performance when Hive is reading, writing, and processing data.
Vectorization?
When Vectorization feature is used, it fetches 1000 rows at a time instead of 1 for processing. So, it can process up to 3X faster with less CPU time. This results in improved cluster utilization. It is to address the latency Problem in Hive by extensive Container use and reuse. Vectorization feature works on Hive tables with ORC File Format only.
Host Configuration:
This was run in a single-node sandbox environment (Hortonworks HDP 2.1) with 2-cores and 8GBs RAM.
Datasets:
HVAC.csv: 7 columns, 8000 rows
Building.csv: 5 columns, 20 rows
Query Comparison with Hive on MapReduce and Apache Tez:
MapReduce:
6/19/13 4:33:07 65 63 7 23 20 Argentina ACMAX22 19 M20
6/20/13 5:33:07 66 66 9 21 3 Brazil JDNS77 28 M3
Time taken: 35.032 seconds, Fetched: 8000 row(s)
Apache Tez - 1st Run:
6/19/13 4:33:07 65 63 7 23 20 Argentina ACMAX22 19 M20
6/20/13 5:33:07 66 66 9 21 3 Brazil JDNS77 28 M3
Time taken: 24.484 seconds, Fetched: 8000 row(s)
Apache Tez - 2nd Run:
6/19/13 4:33:07 65 63 7 23 20 Argentina ACMAX22 19 M20
6/20/13 5:33:07 66 66 9 21 3 Brazil JDNS77 28 M3
Time taken: 11.547 seconds, Fetched: 8000 row(s)
Explanation:
On the 1st run with Apache Tez you see nearly a 32% improvement, but on the 2nd run you see a 66% improvement and that is due to the improvements Apache Tez has done on how it optimizes queries and utilizes processes that are already spun-up within the cluster.
Query Comparison with Hive on MapReduce and Apache Tez utilizing the ORC File Format and Vectorization:
MapReduce with ORC:
6/19/13 4:33:07 65 63 7 23 20 Argentina ACMAX22 19 M20
6/20/13 5:33:07 66 66 9 21 3 Brazil JDNS77 28 M3
Time taken: 34.647 seconds, Fetched: 8000 row(s)
Apache Tez with ORC - 1st Run:
6/19/13 4:33:07 65 63 7 23 20 Argentina ACMAX22 19 M20
6/20/13 5:33:07 66 66 9 21 3 Brazil JDNS77 28 M3
Time taken: 23.232 seconds, Fetched: 8000 row(s)
Apache Tez with ORC - 2nd Run:
6/19/13 4:33:07 65 63 7 23 20 Argentina ACMAX22 19 M20
6/20/13 5:33:07 66 66 9 21 3 Brazil JDNS77 28 M3
Time taken: 12.055 seconds, Fetched: 8000 row(s)
Apache Tez with ORC and Vectorization - 1st Run:
6/19/13 4:33:07 65 63 7 23 20 Argentina ACMAX22 19 M20
6/20/13 5:33:07 66 66 9 21 3 Brazil JDNS77 28 M3
Time taken: 22.127 seconds, Fetched: 8000 row(s)
Apache Tez with ORC and Vectorization - 2nd Run:
6/19/13 4:33:07 65 63 7 23 20 Argentina ACMAX22 19 M20
6/20/13 5:33:07 66 66 9 21 3 Brazil JDNS77 28 M3
Time taken: 15.882 seconds, Fetched: 8000 row(s)
Explanation:
Some queries showed increased performance on ramp-up time, others showed a decrease in performance once the query had been cached. This is most likely due to the fact that the ORC file format and Vectorization are going to be most beneficial to use with large datasets to increase performance and decrease overall cluster utilization and I will attempt to show evidence of that in a later blog post.
Summary:
Apache Tez has been evolving very quickly and is helping move Hive into a real tool for ad-hoc queries. My results are very promising and show some significant improvements over traditional Hive on MapReduce. I look forward to doing this in a real environment and on some larger datasets to see what Apache Tez is really capable of.