Apache Spark is set to be revised as version 2.0 of the software is revealed.
Speaking at the Spark Summit East in New York, Matei Zaharia, creator of Spark said that a new version would be coming in April or May this year, assuring the audience that they would not be changing the majority of APIs.
Three specific new features were mentioned in the upcoming version including Tungsten Phase 2 which will bring Spark closer to Bare Metal, Structured Streaming will be a real-time engine on SQL/DataFrames, and Unifying Datasets and DataFrames.
Key improvements will come to data streaming with Spark Streaming, an area of the open source technology that has seen increased popularity due to the growing amounts of Web and mobile data that is being analysed by organisations.
Data Streaming is already present in the Hadoop world with Apache Storm but Spark is the technology that has received the most attention with companies such as IBM developing around it.
The reason for its popularity is because it allows analysts and developers to work with data that is up to date, which results in the outcome of development being more accurate and timely.
The streaming updates hold importance when looking at emerging distributed processing technologies that are based on Lambda architectures. Lambda uses offline batch processing pipelines alongside real-time processing pipelines for data analytics.
In the end it all boils down to reducing the time to action.
Spark’s popularity has seen it progress as a viable alternative to MapReduce, which is the original data processing engine for big data analytics, although MapReduce is likely to remain in use for the time being, Spark has become the most popular technology in Hadoop.
Updates to Spark will see a high-level API attached to a Spark SQL engine which will aim to make is easier for the development of event timing. This structured streaming will support both batch and real-time analytics. The 2.0 update will particularly focus on applications that use ETL jobs.
According to the Spark developers the new version will see speed improvements of five to ten times.
Alongside the latest version of Spark, Databricks, which was founded by the creators Apache Spark, said that it is releasing the beta version of its Community Edition, a free version of the cloud-based big data platform.
The service is designed to provide users with access to a micro-cluster in addition to cluster management and notebook environment. The idea behind this is that developers will be able to use it to learn Spark without the need to set up and run their own cluster environment.
Additionally, Databricks revealed Dashboards, which is a visual reporting application for Spark clusters; this can be used to provide reports and queries.
Dashboards is an alternative view of a Databricks notebook which is aimed at end users that want to see different views of their data. Importantly, the software can be used without any Spark knowledge or access to critical code.