The popular open source big data processing framework Apache Spark has become one of the most talked about pieces of technology in recent years.
The popularity of the framework, which is designed around speed and ease of use, has seen the likes of IBM, Microsoft, and others align their own analytics portfolios around the technology.
Built on top of Hadoop MapReduce it extends this model in order to use more types of computations including, Interactive Queries and Stream Processing.
Spark can be deployed in three different ways, as a standalone deployment, on Hadoop Yarn, and Spark in MapReduce.
As a standalone deployment Spark sits on top of Hadoop Distributed File System so that space is allocated for HDFS. In this model Spark and HDFS run side by side to cover all Spark jobs on a cluster.
Running on Yarn means that Spark runs without any pre-installation or root access required, while Spark in MapReduce allows a user to start Spark and use its shell without any admin access.
The in-memory processing engine that offers development APIs in Scala, Java, Python, and R, is designed to allow data workers to use machine learning algorithms that require fast iterative access to datasets.
At the core of Spark is what is called Resilient Distributed Dataset, a primary data abstraction which is a resilient and distributed collection of records.
RDDs’ are a collection of elements that are portioned across the nodes of the cluster that can be operated on in parallel.
Apache Spark can be downloaded from the Apache Software Foundation site which lists numerous Spark releases and the type of package so that users can find the right version for their purposes.
Due to its popularity, Spark can also be widely accessed through a number of vendors across the Hadoop ecosystem such as Cloudera, Databricks, Hortonworks, and more.