About Hadoop Spark and the Cloud
The Hadoop ecosystem is composed of many different tools: ambadri, hbase, hive, sqoop,pig, zookeeper, oozie, flume,etc.
But one tool is more well-known than any other: Spark.
When somebody speaks about Hadoop, 99% of the time, he will be talking about Spark.
Spark is really the “heart” of the Hadoop ecosystem.
Spark is mostly used as a batch ETL tool.
The main selling point of Spark is its speed: it is supposed to execute fast (at least, this is what we see when we read the marketing propaganda available on the Spark website).
The 2 youtubes videos given below explains that one machine equipped with Anatella is faster than a Spark cluster composed of more than 300 machines/nodes.
Why is Spark so slow? This is because of a mathematical law named the “Amdhal’s Law”.
You’ll find here two youtube videos that explains:
* the Amdhal’s Law and the “incompressible time” of distributed computation engines.
* why you shouldn’t use Spark for ETL processes.
* why it’s better to avoid using “cloud solutions” (Amazon, Azure) for “data science” projects.
(subtitles in English and French are available).
The presentation used in the two videos:
A quick one-page executive summary about the two videos:
A white paper that summarizes the findings explained in the two videos:
To see the video from Mister Frédéric Pierucci:
The Github repository with the Anatella graphs and the scala codes used in the video: