Question about Hadoop: "Because there is 3 copies of every block, does the map run in 3 places?"
Here’s a transcript of audio from the talk named “Introduction to Apache Hadoop Part 1” given at the Strata Conference 2012 by Sarah Sproehnle.
“So the answer is now. The map is going to run on one. Except in the exception of the redundant job, if there’s slow hardware. But assuming everything is running normally, it will run on one of them. And which one gets picked is based on the scheduling algorithm, this heuristic … there’s a scheduler who’s looking at who’s busy, who’s not busy and those kind of things.”
She refers to the “Redundant job if there is slow hardware …”
“Now if a node appears to be slower then others … this can happen … it’s actually a property of modern hardware … the way disks still have a lot of spinning parts. If a hard drive is failing, it doesn’t usually just go kaput, it fails slowly and it starts being very slow. So, meanwhile, if all of you were my hadoop cluster and one of you has a drive that’s going bad, then the ones that slow, isn’t getting their map tasks done, their not reading their data as fast as everybody else. So I’m being held up … Everyone else is finishing … but that slow one isn’t. I don’t want to have to wait on the slow one. In fact, maybe it doesn’t finish, right? So, the scheduler will keep track of the average time that the tasks completed and if one of these is particularly slower then average it will start another copy of it. It doesn’t want to stop what was going on because it doesn’t know why it’s slow. It could be bad hardware but it could be something about that portion of the data and maybe 1/100th of your data is just a lot more costly to process, it just takes longer. Well, if that’s the case having someone else do it, it’s still going to be slow. But if it’s the case that you hard drive is going bad then it would help to run it some place else. So, hadoop will actually start the task redundantly on another node and let them race. So, it will say, you are particularly slow but I’ll start the same task on another copy of the data and then I’ll see who finishes first and I’ll take the results of the winner. So it’s just a neat feature. It’s something a lot like the many features of hadoop that is tunable …”