Dissecting MapReduce Program (Part 2)
March 2, 2017Installing and Configuring a Hadoop Cluster with Apache Ambari
September 20, 2017When you first heard about Spark, you probably did a quick google search to find out that Apache Spark runs programs up to 100 times faster than Hadoop MapReduce in memory or 10 times faster on disk. This direct comparison with Hadoop, made you wonder whether Spark replaced Hadoop. And if you recently spent time and effort learning Hadoop you probably wondering whether you spent your time and effort on an obsolete technology? Just to give you some relief right away, Spark does not replace Hadoop and if you are a Hadoop developer your job is not in danger.
Our goal in this post is not to steer you or favor one technology over the other. Our goal is to look at both Spark and Hadoop subjectively and understand the application of both tools, so that when you are asked to make tool choices and design decisions in your next project at work you are able to do so with confidence.
We have a free course on Spark named Spark Starter Kit. Click here to enroll and start learning Spark right away.
Spark vs. Hadoop – Storage, Computation & Speed, Resource Management
We are going to compare Hadoop and Spark on three fundamental aspects – Storage, Computation, Computational Speed and Resource Management. Before we do that, first, a little introduction about Spark, Spark’s tag line in Spark’s website is Lightning-fast cluster computing, next when you look below, you will find a one line statement explaining what Spark is. Apache Spark is a fast and general engine for large scale data processing.
Spark is an execution engine which can do fast computations on big datasets. Let’s say you want to calculate average volume by stocks symbol in our stocks dataset, you would write a few lines of code probably in Scala or Python even in Java if you like and Spark will spin up tasks on different machines in your cluster, perform the computation and gets you the result fast. So far, very similar to a Hadoop MapReduce execution, correct?
That is correct, but the main differentiator is speed. When it comes to computation, Spark is faster than Hadoop. How it is faster? Great question. If you look at Apache Spark website again, you will find out Apache Spark has an advanced DAG execution engine that supports acyclic data flow and in-memory computing. Next time you see a Spark developer ask him or her how Spark perform computation faster, you will most likely hear in-memory computation and you will be surprised to hear some random words like DAG, caching, thrown at you. Sad story is most Spark developers don’t clearly understand how Spark achieves faster execution. But you don’t have to worry, we won’t leave you in the dark.
Let’s compare Spark and Hadoop side by side. Before we start with the comparison, let’s recap what Hadoop is all about. Hadoop has 2 core components HDFS and MapReduce. HDFS provides reliable and scalable storage solution for big datasets, MapReduce is a programming model which helps with big data computations.
Spark vs. Hadoop – Storage
Let’s start the comparison with storage. When you look at Spark’s tagline and it’s one line description on spark’s website you will find no mention of storage. Because Spark does not offer a storage solution for your big datasets.
So from where does Spark load the data for computation and where does spark store the results after computation?
For larger datasets, Spark simply leverages existing distributed filesystems like Hadoop’s HDFS or cloud storage solutions like Amazon S3 or even big data databases like HBase, Cassandra etc. Why Spark does not offer a solution for storage? because that’s not Spark’s focus. Spark focus on fast computation and that is it’s strength. So it leverages existing solutions like HDFS, s3, HBase etc. for storage. This does not mean that Hadoop can not leverage existing storage solutions like S3, HBase etc. It most definitely can. But the point to understand is Hadoop does come with its own storage solution that is HDFS, whereas Spark doesn’t.
Spark vs. Hadoop – Computation
Next, let’s focus on MapReduce. Once I heard someone saying Spark’s job executions are fast because Spark does not use MapReduce model. This is just simply not true. As you know MapReduce is a programming model developed by Google to facilitate distributed computation of large datasets and Hadoop offers an open source implementation of MapReduce. If you have gone through MapReduce chapter in any of Hadoop In Real World courses you will know MapReduce is made up of 3 phases – Map, Reduce and Shuffle phases.
Let’s now look at Apache Spark’s homepage again. Here you can find a 4 line code written in Python to compute word count on a file. Here in this code snippet you can clearly the use of both map and reduce functions. So next time when someone says Spark does not use MapReduce concepts, point them back to Spark’s homepage and refer to the map and reduce functions in use. So the conclusion is Spark uses MapReduce programming model that is Map, Reduce and Shuffle phases during computation. So the concepts you learnt with MapReduce is still good and not obsolete.
Spark’s goal is not to find an alternative to MapReduce but instead replace Hadoop’s implementation of MapReduce with it’s own implementation which is much more faster and efficient.
Spark vs. Hadoop – Speed
Spark’s home page proudly claims 100 times faster than hadoop with an impressive graph to support it. This is where we need to pay close attention. Look at the text below the graph, it reads – Logistic regression in Hadoop and Spark; Logistic regression is a statistical model often used in machine learning to form a relationship between a binary variable and a set of parameters. I know that did not make any sense. Let me give an example. With logistic regression, given a set of parameters such as age, gender, smoking time, number of heart attacks, etc. you can predict the probability of person’s death in 10 years.
Again, logistic regression is used to form a relationship between a binary variable and set of parameters. In our example the binary variable is being alive or dead, it is binary because there are only 2 possible values – alive or dead and the set of parameters in our example are age, gender, smoking time etc. With these parameters logistic regression will help us predict the probability of alive or dead in 5 years.
Logistic regression is often used in machine learning and it involves executing a set of instructions on a dataset to get an output. The same set of instructions will again be executed on the output producing another output. Again the same set of instructions is executed on the recent output and the cycle goes on. This type of processing is very common in machine learning and it is called iterative machine learning or simply iterative processing. Logistic regression is a good example of iterative machine learning. In Hadoop Developer In Real World course, we saw how to calculate Pagerank on a set of Wikipedia pages and for those who took the course and went over the pagerank lessons would remember that we executed the same job multiple times. Page Ranking is another example of iterative machine learning.
Spark is extremely fast compared to Hadoop when we deal with iterative machine learning. How? One of the main reason is Spark keeps and operate on data from memory. There are other important reasons as well. We will see one by one as in the upcoming posts. It will be overwhelming to cover each and every one of them now. But what about use cases which are not similar to logistic regression? What about all you want to do is calculate average volume of stocks symbol in a stocks dataset? In such traditional use cases Spark will still be faster compared to Hadoop but not in the magnitude of 100. It is safe to assume Spark on average is 10 times faster than Hadoop because not all use cases would be similar to logistic regression.
Given Spark excels with iterative machine learning which is an essential part of machine learning makes Spark an ideal tool of choice for Machine Learning. Spark has inbuilt libraries for machine learning named MLLIB. This extensive support for machine learning makes Spark and ideal tool of choice for data scientists.
So far we have compared Spark and Hadoop from 3 angles – Storage, Computation and Speed.
Spark vs. Hadoop – Resource Management
Let’s now talk about Resource management. In Hadoop, when you want to run Mappers or Reducers you need cluster resources like nodes, CPU and memory to execute Mappers and reducers. Hadoop uses YARN for resource management, and applications in Hadoop negotiates with YARN to get the requested resources needed for execution.
When you try to execute a spark job let’s say to compute word count on a file, Spark will need to request for cluster resources like memory and CPU to execute tasks on multiple nodes in the cluster. Which means spark needs to negotiate with a resource manager like YARN to get the cluster resources it needs to the execute the job. Spark comes with an inbuilt resource manager which can perform the functionality of YARN. so we don’t need YARN for resource management because Spark comes with a resource manager out of the box. There is an interesting option for resource management, we will discuss that in a bit.
Spark – Modes of Execution
So if you look at the illustration, Spark does not offer a storage solution. Spark uses MapReduce concepts like Map, Reduce and Shuffle and it aims to replace Hadoop’s implementation of MapReduce with a much more faster and more efficient execution engine. Spark has an out of the box solution for resource management. Given Spark’s strength is execution and not strorage and this means that Spark is not designed to replace distributed storage solutions like HDFS or S3 and also it does not aim to replace NoSQL databases like HBase, Cassandra etc.
With all the information we know now, let’s ask a simple question – do we need a Hadoop cluster to run Spark? Short answer is NO.
You can run Spark as completely stand alone. meaning without a Hadoop cluster but Spark has to get the data from a file system like HDFS, S3 etc. Spark can also read the file from the local filesystem but that is not ideal because we have to make the data available in all nodes in the cluster because Spark can run computational tasks in any of the node in the cluster. So if you have a dataset with size 1 TB, you need to copy the dataset to all the nodes in your Spark cluster if you decide to read the dataset from local file system. Clearly this will result in a lot of wasted space. So local filesystem is not ideal for storing data and hence Spark has to leverage other existing storage systems like HDFS from another hadoop cluster or S3 etc. This means irrespective of what storage solution you use, when you execute a spark job against a dataset, the dataset should be brought over the network from the storage.
What if you have existing Hadoop cluster? If you have an existing Hadoop cluster you can still have Spark as a separate cluster and use the HDFS in your Hadoop cluster for storing data and Spark can get the data from HDFS. There are couple of downsides in having one cluster for Hadoop and a separate cluster for Spark. First obvious downside is maintenance and cost overhead to support 2 separate clusters one for Hadoop and one for Spark. Second downside is every time you refer a dataset in HDFS, which is in a separate cluster, you would have to copy the data from Hadoop cluster to Spark cluster every time we want to execute something on the dataset that resides in HDFS. Which means there is no data locality and we know network bandwidth is a scarce resource.
So for these reasons, if you already have a Hadoop cluster it makes perfect sense to run Spark on the same cluster as Hadoop. This means Spark will execute its jobs on the same nodes where the data is stored and this avoids the need to bring data over the network from another cluster or service like S3. As a bonus Spark can leverage YARN in your Hadoop cluster to manage the resources instead of using Spark’s out of the box resource manager. Yes that is correct you can choose to use YARN instead of Spark’s out of the box resource manager and Spark allows that. If you already have a Hadoop infrastructure this setup would make a lot of sense.
Is Spark a replacement for Hadoop?
Spark is intended to enhance the Hadoop stack and not replace Hadoop stack. Spark’s core competency is speed and hence it makes sense to use Spark more and more for computational needs. Which means Spark is a great tool for data analysts who currently using tools like Pig or Hive, they can use Spark for faster execution times and get results faster.
Should we convert all the existing MapReduce jobs in our existing Hadoop cluster to Spark jobs? This is what we would suggest – if you are creating new jobs definitely look to implement them in Spark. If you have lot of MapReduce jobs it is probably not worth the effort to convert all the jobs to Spark jobs, look at the ones which are slow and target them for migration to Spark.
You read all the way to the end and this means you will love our free course on Spark. Click here to enroll and start learning Spark right away.
Summary
So to summarize, the power of Spark lies in its computational speed and it’s execution engine is very efficient when compared to Hadoop’s MapReduce implementation and Spark is perfectly designed to work on top of an existing Hadoop cluster leveraging YARN for resource management and HDFS for storage. We hope this post gave you a good idea about Spark and offered some key insights about Spark when compared with Hadoop. Keep this in mind, Spark is intended to enhance and not replace Hadoop stack. So don’t plan to decommission your existing Hadoop cluster yet.