Hadoop’s progression from a large scale, batch oriented analytics tool to an ecosystem full of vendors, applications, tools and services has coincided with the rise of the big data market.

While Hadoop has become almost synonymous with the market in which it operates, it is not the only option. Hadoop is well suited to very large scale data analysis, which is one of the reasons why companies such as Barclays, Facebook, eBay and more are using it.

Although it has found success, Hadoop has had its critics as something that isn’t well suited to the smaller jobs and is overly complex.

CBR identifies five Hadoop alternatives that may better suit your business needs.

 

1. Pachyderm

Pachyderm, put simply, is designed to let users store and analyse data using containers.

The company has built an open source platform to use containers for running big data analytics processing jobs. One of the benefits of using this is that users don’t have to know anything about how MapReduce works, nor do they have to write any lines of Java, which is what Hadoop is mostly written in.

Pachyderm hopes that this makes itself much more accessible and easy to use than Hadoop and thus will have greater appeal to developers.

Pachyderm

With containers growing significantly in popularity of the past couple of years, Pachyderm is in a good position to capitalise on the increased interest in the area.

The software is available on GitHub with users just having to implement an http server that fits inside a Docker container. The company says that: "if you can fit it in a Docker container, Pachyderm will distribute it over petabytes of data for you."

 

2. Apache Spark

What can be said about Apache Spark that hasn’t been said already? The general compute engine for typically Hadoop data, is increasingly being looked at as the future of Hadoop given its popularity, the increased speed, and support for a wide range of applications that it offers.

However, while it may be typically associated with Hadoop implementations, it can be used with a number of different data stores and does not have to rely on Hadoop. It can for example use Apache Cassandra and Amazon S3.

Spark is even capable of having no dependence on Hadoop at all, running as an independent analytics tool.

Spark’s flexibility is what has helped make it one of the hottest topics in the world of big data and with companies like IBM aligning its analytics around it, the future is looking bright.

 

3. Google BigQuery

Google seemingly has its fingers in every pie and as the inspiration for the creation of Hadoop, it is no surprise that the company has an effective alternative.

The fully-managed platform for large-scale analytics allows users to work with SQL and not have to worry about managing the infrastructure or database.

The RESTful web service is designed to enable interactive analysis of huge datasets working on conjunction with Google storage.

Google BigQuery

Users may be wary that it is cloud-based which could lead to latency issues when dealing with the large amounts of data, but given Google’s omnipresence it is unlikely that data will ever have to travel far, meaning that latency shouldn’t be a big issue.

Some key benefits include its ability to work with MapReduce and Google’s proactive approach to adding new features and generally improving the offering.

 

4. Presto

Presto, an open source distributed SQL query engine that is designed for running interactive analytic queries against data of all sizes, was created by Facebook in 2012 as it looked for an interactive system that is optimised for low query latency.

Presto is capable of concurrently using a number of data stores, something that neither Spark nor Hadoop can do. This is possible through connectors that provide interfaces for metadata, data locations, and data access.

The benefit of this is that users don’t have to move data around from place to place in order to analyse it.

Like Spark, Presto is capable of offering real-time analytics, something that is in increasing demand from enterprises.

Presto supports standard ANSI SQL, including complex queries, aggregations, joins, and window functions. Lovers of Java will be happy to hear that this is what the system is implemented in.

 

5. Hydra

Developed by the social bookmarking service AddThis, which was recently acquired by Oracle, Hydra is a distributed task processing system that is available under the Apache license.

It is capable of delivering real-time analytics to its users and was developed due to a need for a scalable and distributed system.

Having decided that Hadoop wasn’t a viable option at the time, AddThis created Hydra in order to handle both streaming and batch operations through its tree-based structure.

This tree-based structure means that can store and process data across clusters that may have thousands of nodes.

Hydra features a Linux-based file system in addition to a job/client management component that automatically allocates new jobs to the cluster and rebalances existing jobs, it is also capable of automatically replicating data and handling node failures.