Michael Kopp

Over recent years, the term ‘Big Data’ has found its way into the common vernacular of the IT press.

Unsurprisingly, it has begun to form the key focus for many budget discussions within enterprise organisations as companies consider how they can unlock the value of their data. Big Data projects are now moving from the experimental stage to providing real returns on investment, with tools such as Hadoop and Cassandra forming an integral part of organisations’ enterprise-wide analytics platform.

However, Big Data is a big investment, both in terms of money and time. The faster companies are able to glean insights from their data to support their business decisions the more valuable those insights are. With this in mind, how well a company’s Big Data tools perform and the speed at which they can deliver information is critical. Companies need to take a best practice approach to Big Data performance to ensure they are eliminating the risks and costs associated with poor performance, availability, and scalability.

As its name suggests, ‘Big Data’ is defined in part by the sheer size of datasets that are being created in today’s enterprise, as well as the velocity at which this data is created. For organisations looking to better understand their customers, reduce operational costs, or gain a competitive edge through better informed decision making, this influx of data can provide the answers: the challenge is in how to access and interpret the data. Up until recent years this has been impossible, but the new wave of Big Data solutions is changing this, allowing petabytes of information to be analysed in hours instead of months.

This is genuinely transformative technology that can change the way enterprise organisations operate. However, Big Data technologies also bring with them new performance risks and challenges. These must be addressed and managed; otherwise the benefits of being able to churn more data, more quickly won’t materialise, and it won’t be long until the end-users start to complain.

Predicting performance disasters with Hadoop and Cassandra

These new and unique performance challenges revolve around NoSQL databases such as Cassandra, HBase, MongoDB and large scale processing environments, such as Hadoop. Most large organisations create gigabytes of data on a minute-by-minute basis and, as such, are looking at Hadoop MapReduce to help automate analysis for complex queries on these extremely large data-sets. This scale of processing and analysis was previously impossible using traditional analytics tools, which are unable to cope with the volume of data.

However, Hadoop doesn’t just run in isolation as MapReduce jobs are always run in the Hadoop cluster. For these jobs to run efficiently and smoothly the applications need to rely on the availability of an underlying infrastructure of servers and virtual machines. The data structure layer on top of the raw data, in addition to the data and information flowing through the system, also have an impact on both distribution and performance of MapReduce jobs. As such, any performance problems within the Hadoop environment will not only impact the speed at which the tool performs its analysis, which could impact on Service Level Agreements (SLAs) should results be delayed; but it could also cause increased pressure on an organisation’s hardware resulting in increased capex and opex investments as the tool will be running inefficiently and using additional power.

Essentially, Hadoop and the MapReduce jobs it runs have a number of moving parts. Given their distributed nature, this adds a layer of complexity that in many cases can leave IT administrators blind to what’s happening inside the application. With the extreme complexity and scale of Big Data applications, trying to locate and remediate performance bottlenecks and problem hot spots manually is impossible; automating application performance management is the only way to combat performance issues in these environments due to the sheer volumes of data that need to be analysed to detect problems.

An inability to effectively manage the performance of Big Data applications will, in many cases, mean a failure to meet service-level agreements (SLAs), which will inevitably lead to financial loss in some shape or form. Simply throwing more hardware at the problem might provide a solution in the short-term; but the ongoing associated costs make this is a very inefficient solution. As with any enterprise application, it’s important to be able to see a high-level overview of the systems and how they are operating so that bottlenecks and issues can be flagged early on. Yet it is equally important to have the tools available to drill down into the application code so that these problems can be resolved efficiently and accurately, ensuring that administrators do not have to disrupt other applications by churning through irrelevant log files.

A similar approach needs to be taken to NoSQL databases. Cassandra NoSQL databases can scale horizontally and they allow for very low latency requests and so are typically being used for applications where real-time insights are the name of the game. However, for all the speed that Cassandra provides, this speed is dependent on native applications meaning Cassandra is only as fast as its parts. This means that to ensure companies are benefiting from the speed of use of Cassandra, they need to take into account the performance of all parts across the application delivery chain. As such, having end-to-end visibility across the entire service delivery chain and all transaction processes is crucial to spotting and addressing problems in the NoSQL environment.

Bearing in mind the unique performance challenges in managing Big Data environments, a successful implementation relies on a number of key considerations:

  • Large Hadoop environments are hard to manage and operate. Without automation in terms of deployment, operations, monitoring, and root-cause analysis, they quickly become unmanageable. Organisations must have a monitoring solution in place that proactively informs IT departments of any infrastructure or software issues that could affect their operation. These tools need to provide a timely solution to pinpoint the root cause of any performance issues.
  • The easiest way to identify new performance issues is to detect and analyse change. As such, enterprises should adopt a lifecycle and 24/7 production approach to APM, enabling them to notice changes in data and compute distribution over time. Due to the sheer scale of the applications in Hadoop environments it is imperative to automate this process as it is impossible to replicate a production system to conduct tests. This approach to APM will also allow IT departments to immediately pinpoint any negative knock-on effects created by the introduction of a new software release.
  • Throwing more and more hardware at the problem is unlikely to remedy performance issues in Big Data applications. Although cheaper hardware can be used for Hadoop, it is still an additional cost. But more than that, additional infrastructure will also increase the operational drag. Every node that is added will make traditional log-based analysis more complicated. Instead, IT departments should implement an APM solution that enables them to understand and optimise MapReduce jobs at their core and reduces the time and resources needed to run them.

In short, for Big Data solutions to perform and deliver on the promises made by vendors, a new approach to application performance management is needed: one that goes beyond log-file analysis and point tools. Companies must not get caught in the trap of thinking their traditional approaches to managing application performance will work.

Instead, they should seek out new approaches that can cope specifically with the architecture of dynamic, elastic Big Data environments. With this new approach, enterprises can make highly optimised Big Data implementations much easier to achieve, leaving them free to maximise their ROI from the interpretation and delivery of Big Data insights.