Apache Tez synopsis from a top engineer in the Big Data space ...

Oct 21, 2013

I’ll keep this anonymous … but here’s some good insight.

“Tez is a project led by Hortonworks. In Hadoop batch processing a lot of jobs are interdependent. Foe example, in a lot of data flow architectures job2 waits on job1 to complete and reads job1’s output files, job3 waits on job2 to complete and reads from job2’s output files and so on. The only solution before Tez was Oozie (a workflow manager). I used to manage Oozie development group in XXXX, and we open sourced it to Apache under my leadership. Oozie however is outside of Hadoop compute engine and queries JobTracker in Hadoop 1.0 or Resource Manager in Hadoop 2.0 to figure out if a job is finished.

Tez skips this cost by providing a lot of that functionality within an Hadoop execution layer. The aim is to enable Pig or Hive or other batch jobs to leverage interdependency and create a common execution layer. Tez is a promising idea. I was leading XXXX-Hortonworks interaction when Tez idea was being discussed. I expect Tez to mature over next 1-2 years. Optimizing file dependency at an execution layer level, rather than outside of Hadoop (like Oozie does) provides significant time savings. About 9 months ago Hortonworks had announced that they were able to about 5% Hive queries a lot faster by using Tez execution layer. I am not sure how far they have come since then.”

Apache Tez synopsis from a top engineer in the Big Data space ...

Discussion about this post