How Hadoop is putting a Spark into enterprise big data

A quick glance at any of the major analyst firms’ outlooks for big data will show you that the market is growing and much of the attention is on Hadoop.

The number of enterprises with big data workloads in production jumped by nearly 30% in the period between 2014 and 2015 and this trend is expected to continue.

Hadoop’s diverse community has benefited from the growing importance of the big data market, with companies like Cloudera and Hortonworks growing at a rapid pace as their technologies gain traction across industries.

It shouldn’t come as a great surprise that the technology is gaining a great deal of attention; it has become synonymous with big data since its creation in 2005 by Doug Cutting, who was working at Yahoo and is now chief architect at Cloudera, and Mike Cafarella who is professor of computer science at the University of Michigan.

A large community has helped support the growth of Hadoop and technologies such as Apache Spark are making it an appealing option for businesses going down the big data path.

Some of the lures of Hadoop include its ability to deliver large scale data analytics at a lower cost than many analytics platforms and improved speed.

Barclays Bank is one notable adopter of the technology having moved away from an Oracle database.

In an interview with CBR, Barclays Head of Information Peter Simon said that it was taking about six weeks to process data across its small business customers, with Hadoop that has been reduced to about 21 minutes.

Questions have previously been asked about use cases for Hadoop in the enterprise, with the technology seeing plenty of proof of concepts being undertaken but less in the way of large scale deployment.

As more examples of use of Hadoop like that at Barclays arise though, you should probably expect to see more follow suit.

Hadoops usage is appealing to a wide range of industries for use cases such as Data Discovery; you can use Hadoop as a sandbox to look for new patterns and ideas for new products. ETL offload is another popular use case where you can move high-scale data transformation workloads to Hadoop, similar to the work undertaken by Barclays.

One of the challenges that the technology has faced for years is its complexity, for starters – Hadoop is not one thing.

Hadoop is a Java-based programming framework that supports the processing of large data sets in a distributed computing environment.

The software framework can be used for storing data and running applications on clusters of commodity hardware.

So Hadoop allows you to store very large data files but the real value proposition is getting value from that data, which is where technologies like Map Reduce come in.

MapReduce moves processing software to the data rather than the data to the software, which can be an extremely long process depending on how much data you actually have.

While MapReduce is the incumbent technology, the Apache Spark technology is quickly surpassing it in terms of popularity.

According to research by Syncsort, which polled over 250 data architects, IT managers, developers and data scientists, with 66% from organisations with revenues over $100 million, Apache Spark is gaining the most interest.

The research found that 70% are most interested in Spark while MapReduce is of interest to 55%, however, Map Reduce will likely remain the most prevalent compute framework in production, although Spark deployments are likely to increase.

Spark’s popularity has seen major vendors such as IBM, Cloudera and Hortonworks all develop around it and contribute to its growth into the number one choice.

Andy Leaver, VP International, Hortonworks, told CBR why Spark is becoming the most popular framework: "Developers and data scientist have turned to Spark for predominantly data science like processing workloads.

One driver of the popularity is that Hadoop vendors including Hortonworks have backed Spark as an additional method or processing data stored in HDFS.

Integration with the YARN framework means that Spark workloads can also run alongside other batch and interactive processing workloads without conflicting for resources in a cluster.

Like MapReduce, Spark is an open source big data processing framework but it is a lot faster than MapReduce; perhaps around 100 times faster. It is the fast that Spark processes data in-memory.

Tendü Yogurtçu, GM, Syncsort’s Big Data business, told CBR: "The main difference is that MapReduce is very much designed for the batch type of workload."

Other features that Spark has are that it offers comfortable APIs for Java, Scala and Python and it is designed to be easier to program than MapReduce and Spark really made its name by being very good at predictive analytics and machine learning.

So while the writing may not be on the wall for MapReduce quite yet, it is certainly being surpassed in many ways.

Leaver said: "Over time more and more workloads will inevitably move from MapReduce to Spark. Chances are however that MapReduce will still have a role for certain types of processing much like mainframes still have a role in the world of desktop and laptop computers."

The question will come for the technology as to how many platforms are supported in the Hadoop ecosystem. Like OpenStack, the technology benefits greatly from its open nature where many different options are supported.

However, with so many choices comes complexity and that doesn’t always make it easy to adopt, so hard choices need to be made as to what becomes the standard.

On the price front, Hadoop has the benefit of being open source but clearly will still require money spent on machines and staff. It is the cost factor though that is a major selling point for many people, currently around 63% feel that Hadoop will help them increase business and IT agility, while 55% expect to increase operational efficiency and to reduce costs.

The research suggests that these benefits will see use cases such as mainframe and Enterprise Data Warehouses being offloaded to Hadoop.

Add to these use cases the appeal of Hadoop as a way to innovate using data from social media and IoT in order to apply predictive analytics and you build a package of use cases for one piece of technology, rather than numerous ones that could lead to complexity.

Despite these benefits though, there are downsides to the technology such as the complexity, keeping up with new tools and skills.

To solve these problems there are a number of companies working in the Hadoop community; Hortonworks and Cloudera have already been mentioned while IBM is also a big player, Datameer, Dell, Pentaho and MapR.

The list would go on but you have to stop somewhere. The point is that this technology has some of the biggest companies and brightest people working on the technology to perfect it for ease of use while maintaining its enterprise scale.

In addition to this, universities now have course available to learn core Hadoop concepts and capabilities, combined with the work being done in training by both vendors and System Integrators, the skills gap is narrowing.

Leaver, said: "Taken in combination we are seeing the skills gap begin to close. The pool of people with Hadoop training or hands on Hadoop skills is growing."

The different vendors will of course have their own strengths and weaknesses but the core technology remains the same so it should be simple enough to find one that really suits your needs.

For example, MapR allows for Hadoop to be accessed via the Network File System to allow for faster data management and system administration. Cloudera meanwhile offers unified batch processing, interactive SQL, Interactive search, and role-based access controls.

As the big data market has grown, so has the attention it has received from regulators, you just have to look at the data governance work being done by the EU. Combined with this is the security element, with all that data it has to be said that making sure your data is secure is of paramount importance.

Hadoop also has a role to play here due to data lakes being a common use case for the technology. Technology that protects the lake includes Apache Atlas and Apache Ranger for governance and security respectively.

The reason for mentioning this is that Hadoop and the community are aiming to make it the go to choice for enterprises that are on a big data journey. Without the elements that mentioned above, Hadoop can’t be considered a viable option, but with them it is increasingly becoming the best choice.

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

Sign up for our regular news round-up!

Sign up for our weekly news round-up!

Sign up to the newsletter: In Brief

I would also like to subscribe to:

Thank you for subscribing