View all newsletters
Receive our newsletter - data, insights and analysis delivered to you
  1. Technology
  2. Data
June 13, 2016updated 22 Sep 2016 12:13pm

Apache Spark tutorial

Find out a little bit more about this open source big data tool.

By James Nunns

The popular open source big data processing framework Apache Spark has become one of the most talked about pieces of technology in recent years.

The popularity of the framework, which is designed around speed and ease of use, has seen the likes of IBM, Microsoft, and others align their own analytics portfolios around the technology.

Built on top of Hadoop MapReduce it extends this model in order to use more types of computations including, Interactive Queries and Stream Processing.

Spark can be deployed in three different ways, as a standalone deployment, on Hadoop Yarn, and Spark in MapReduce.
As a standalone deployment Spark sits on top of Hadoop Distributed File System so that space is allocated for HDFS. In this model Spark and HDFS run side by side to cover all Spark jobs on a cluster.

Running on Yarn means that Spark runs without any pre-installation or root access required, while Spark in MapReduce allows a user to start Spark and use its shell without any admin access.

Hadoop

The in-memory processing engine that offers development APIs in Scala, Java, Python, and R, is designed to allow data workers to use machine learning algorithms that require fast iterative access to datasets.

Content from our partners
Green for go: Transforming trade in the UK
Manufacturers are switching to personalised customer experience amid fierce competition
How many ends in end-to-end service orchestration?

At the core of Spark is what is called Resilient Distributed Dataset, a primary data abstraction which is a resilient and distributed collection of records.

RDDs’ are a collection of elements that are portioned across the nodes of the cluster that can be operated on in parallel.

Apache Spark can be downloaded from the Apache Software Foundation site which lists numerous Spark releases and the type of package so that users can find the right version for their purposes.

Due to its popularity, Spark can also be widely accessed through a number of vendors across the Hadoop ecosystem such as Cloudera, Databricks, Hortonworks, and more.

Hortonworks

Websites in our network
Select and enter your corporate email address Tech Monitor's research, insight and analysis examines the frontiers of digital transformation to help tech leaders navigate the future. Our Changelog newsletter delivers our best work to your inbox every week.
  • CIO
  • CTO
  • CISO
  • CSO
  • CFO
  • CDO
  • CEO
  • Architect Founder
  • MD
  • Director
  • Manager
  • Other
Visit our privacy policy for more information about our services, how New Statesman Media Group may use, process and share your personal data, including information on your rights in respect of your personal data and how you can unsubscribe from future marketing communications. Our services are intended for corporate subscribers and you warrant that the email address submitted is your corporate email address.
THANK YOU