Opinion: Native HDFS on EMC Isilon

Sep 29, 2013

From my viewpoint the main issue I see is Hadoop’s main focus on “High Reliability” not really “High Availability” … If Isilon is acting as a middle man for HDFS and making the Hadoop’s NameNode think it’s written 3x replicas and there is only one that actually exists on the Isilon cluster … then it possibly removes the overall reliability of Hadoop as a whole for the following reasons that I can see:

Part of what makes Hadoop reliable is the actual existence of 3 separate and independent copies of each block. If a file is written once to the Isilon cluster and something goes wrong such as corruption … that 1 copy is now corrupt and when a job runs against that file and Hadoop realizes this, it will then try to access the other copy it thinks exists, which it technically doesn’t causing the job to fail …
This will also be an issue if the Isilon were to disappear as it’s a single point of failure, where as Hadoop has tried to move away from SPOFs with it’s shared-nothing approach.
In the situation where jobs are running slower than average due to hardware… and Hadoop runs a copy of the job on another node to see who finishes first, Isilon is great if the jobs are being hampered by compute limitations, but if the disk is the culprit, then it decreases the options of Hadoop. In a traditional Hadoop shared-nothing scenario it has the ability to pull the data from an actual different set of disks that may be faster whereas with Isilon, it’s shared and starting the copy of the job on the same set of disks and technically adding to the problem of the slow job.

This is what I can think of for now … I’m not saying Isilon isn’t a good solution, but from what I can see, it will work with certain types use cases “i.e., Hadoop data repository”.

Opinion: Native HDFS on EMC Isilon

Discussion about this post