The Evolution of Big Data Joe
Wow, a lot has changed since my last post back in September 2015. Both in Big Data and personally. I am no longer on the dark side (aka Sales/Consulting ) :-) .. I am now at Western Digital as the Global Sr. Director, Big Data Platform and Analytics and I am loving it! I have an amazing team spread across the world and we are doing some awesome things.
My perspective has changed a lot when it comes to Big Data since moving to a company to fully implement their Big Data vision. Some of the things I’ve learned are ..
Hadoop is not Big Data.
I fell into the same trap as many others that Hadoop could do it all when it came to “Big Data” .. which isn’t the case. Hadoop does enable you to “Think Big” but there is an ecosystem of platform technologies that has to be in your Big Data Platform.
Massively Parallel Processing (MPP) is needed.
MPP (AWS Redshift, Teradata, Netezza, etc) is essential to providing a last mile for relational data transformed in Hadoop. While you shouldn’t put all of your data there as it is pricier then storing your data in Hadoop, think of it as a transformation repository for Gigabytes to Terabytes of data and/or months worth of ad-hoc, performant data. Keep the Petabytes and years worth of data in Hadoop. While I know anyone that has read something about MPP-like technologies sitting on top of Hadoop; Cloudera Impala, Presto, Apache HAWQ .. they still aren’t truly MPP. While they are catching up, they still don’t have same core features that MPP technologies have had for years, like something as fundamental as full SQL compliance.
NoSQL (Not Only SQL) should be considered.
For the data that doesn’t have to have a relational model, like key-value pairs, and will be accessed by front-end applications that will make sense of data, then NoSQL technologies should be considered. There are a lot of big players in this space; MongoDB, Cassandra, Riak; picking the right one depends on your use case. Right tool, right fit.
Some hardcore Hadoop'ers will bring up HBase for this type of workload. If it’s a small implementation, then go ahead give it a try. Personally, I have found that it’s better to separate the NoSQL workload from the Hadoop cluster. The NoSQL players out there do a better job of resource management.
Here’s a good overview of NoSQL technologies: https://kkovacs.eu/cassandra-vs-mongodb-vs-couchdb-vs-redis (It’s not always up to date, but it gives you a great understanding of their strengths and weaknesses.)
Data Science means nothing without context, get the talent you need by training internally.
There are a ton of individuals on LinkedIn calling themselves a “Data Scientist”. (Nearly 27,000 as I write this and that’s only 1st and 2nd level connections for me.) The problem with this, is a Data Scientist is not a general title. A position like Systems Engineer/Administrator can jump between different business verticals and be valuable individual contributor quickly. A Data Scientist can’t do the same .. they need the context of the data. A Data Scientist working in retail or social media will have a very long ramp up period to be valuable to a manufacturing company. They need to be intimate with the data and understand the story the business data is trying to tell.
I have found that it’s much better to train exceptional internal talent that understands the story of the data and has the hunger to want to learn the skills necessary. You don’t want them to necessarily become a “Data Scientist” but a “Data Hacker”. A “Data Hacker” is someone that combines the context of the data with the skills necessary to put their theories into something tangible. Give this “Data Hacker” access to the data with tools available like; Scala/Spark, Python/R on Hadoop and magic can happen.
A specialized and dedicated IT team should be assigned to Big Data/Clustered Technologies.
The saying “It Takes A Village” definitely applies when it comes to Big Data and Clustered Technologies. Typically companies will throw Hadoop, MPP and NoSQL into a general IT group that is used to supporting legacy RDBMS technologies. Then Big Data/Clustered technologies get lumped in the same place where data is stored and SQL queries are ran against it for reporting purposes.
If you want to truly get the benefits out of your investment you need to have a team that can help the business move from a “raw data > information” model to “raw > data > information > insights > impact” model. Helping the business get the value out of the data so it’s not just about “reactive reporting” but moving to “proactive tuning”.
Also, having this dedicated team helps enable business build that valuable “Data Hacker” I mentioned above. Combining the business subject matter expert with a developer from IT can expedite the process of taking an idea and putting it into production. Pairing those 2 types and combining that with training and knowledge transfer can be a great way to start your internal training program.
A specialized and dedicated IT Big Data/Clustered technologies team has talent from all spectrums of IT. Development, Operations, and Support. Also, the team should be cross-functional in a “DevOps”-type model. By having everyone invested in the product you are trying to deliver into the business, quality goes up exponentially. Eliminating Developers just throwing code over the wall to Operations and having everyone ownership is critical.
Hadoop can exist in the Cloud successfully.
4+ years ago, I would have never even considered a cloud based Hadoop deployment; too much latency, too much sharing, not enough horsepower, too many compromises. 3 years ago, I would have warned against it, but mostly because it was too expensive. Today, I wouldn’t deploy it any other way; agility, scalability, cost are in-line with on-premise deployments, easier to support with a globally distributed team. Something to note, this is still using the cloud service providers as an IaaS (Infrastructure as a Service) provider. There is still a lot we can do to optimize our environment to utilize cloud SaaS (Software as a Service) services, but I am still waiting for a lot of that to shake out before placing any long term bets.
There is a lot more I have learned and changed my stance on, but I can’t put it in writing. It’s top secret, magic sauce kind of things. :-)
The thing I love most about Big Data is there is so much going on and the landscape is still growing and forming after all this time.
To all my Big Data folks, what are some of the things that you have changed your mindset on over the years? Also, feel free to disagree with my points .. I love a spirited debate! :-)
Please feel free to leave comments and questions below.
Thanks for reading.
-Big Data Joe