5 Reasons Your Data Lake Is Failing - And What You Can Do About It

The reality is that data lakes are failing to support the time-to-market requirements new analytics-driven innovation requires, and it is safe to say that in many companies, data lakes are widely perceived to be expensive and ineffective. So why is it happening?

In this article, we look at some of the common culprits turning data lakes into data swamps, at the same time delivering some advice based on experience to help companies from experiencing data lake disasters.

Problem #1: Lack of Real Experience

Many data lake programmes are suffering from lack of real experience with entire teams or departments exploring and testing Hadoop technologies for the first time. They are often disoriented due to the very different paradigms and approaches, and by the novelty of these tools and frameworks, which have very little in common with well-known traditional technology stacks.

As a result these programmes are very slow, with the implementation becoming complex and difficult, the business objectives becoming quickly obsolete, and the original excitement slowly fading away. At this point, many stakeholders begin to wonder whether a big data solution is ever going to take off and achieve the original goals.

What Can Companies Do About It?

In the endeavour to achieve data lake success, working with thought leaders is key. As with all emerging technologies, it is important to start with confidence and get in touch with the experts. These are the pioneers that have already accumulated a good amount of knowledge and expertise, who have already failed in their first attempts and learned from those lessons to finally identify best practices and successful solutions.

Problem #2: Lack of Engineering Skills

Simply put, most data lakes today suffer from poor design and implementation. The shortage of software engineering talent combined with the lack of Hadoop experience is definitely one of the root causes. Such shortage leads in fact to hiring inexperienced data engineers. At the same time, it’s also hard to identify the right skills required for a good data engineer. The ability to master technologies like Spark, Kafka, HBase, is absolutely positive but could not be enough when it comes to building a complex and well-engineered data lake and delivering a production grade platform.

A lack of engineering skills often leads to data lake implementations with poor architectural design, poor integration, poor scalability and poor testability – all of which can lead to a level of instability that unfortunately only a full rewrite can fix.

What Can Companies Do About It?

Hiring solid software engineers is the best way to minimise this risk or quickly recover from it. So they should invest more in talented software engineers and train them on Hadoop technologies if necessary. It is much easier and faster to skill up a bright software engineer than turn a Hadoop “certified” professional (with no software engineering) into a good software engineer. Companies can also largely reduce the need for Hadoop engineers and accelerate their programmes by investing in a data lake management platform.

Problem #3: An Immature Operating Model

Particularly in the initial phase, the typical separation between IT and business can be a big obstacle. Data scientists tend to fall into business silos, while data engineers fall into IT silos. Yet a successful analytic solution relies completely on close collaboration between data scientists and data engineers.

Data scientists must use the tools made available by IT, while data engineers need to productionize and operationalize what is implemented by data scientists. Without an operating model that brings these two roles together, most prototyped analytic algorithms and models struggle to reach production.

What Can Companies Do About It?

Companies need a robust operating model and governance structure – both are key components of any big data engine and ecosystem. Companies must strategically shape the organisation structure and the operating model in a way that positively supports the implementation of the analytic solution. In facilitating constructive collaboration and close feedback loops, IT and business are brought closer through the various phases of the productionisation of use cases. Ensuring this model is deployed effectively within the business is a key success factor for best in class data-driven organisations.

Problem #4: Poor Data Governance

By definition, data governance “is a set of processes that ensures that important data assets are formally managed throughout the enterprise. Data governance ensures that data can be trusted and that people can be made accountable for any adverse event that happens because of low data quality”.

Poor governance is the reason for many failures. During the initial phase of any data lake implementation, there is often not enough focus on how to organise and control data. Given data is to be accessed by multiple users through several applications, governance is essential to build trusted solutions where data quality and accountability both play an important role, especially when we talk about production systems.

What Can Companies do About it?

Governance by design should be the practice. Phrases like “let’s first move all the data in, then we will decide what to do” should be banned. This is the wrong habit of thinking of Hadoop as a simple storage system on top of which we can easily apply additional control mechanism at a later stage. Governance can easily become a quite complex task and should instead be analysed and implemented from the very beginning. Data should be moved into the data lake in a controlled and planned way, where the data lake is in this sense to be considered more as a data reservoir.

Problem 5#: Missing Foundational Capabilities

There is a general tendency to underestimate the complexity of data lake solutions from a technical and engineering perspective. Every data lake should expose a good number of technical capabilities, some of these are self-service data ingest, data preparation, data profiling, data classification, data governance, data lineage, metadata management, global search, security. In fact, Hadoop distributions make available several components which can provide the tools and mechanisms to implement the above-mentioned capabilities but do not provide a complete implementation of the solution.

What Can Companies Do About It?

It is very important to build most of these foundational capabilities prior to start ingesting, and this is a type of investment very often missing. Data should be moved into the lake in a considered way. While flowing in, it should be cleansed, validated, profiled, indexed, secured and correctly tracked by extensive metadata. Is this already happening in your data lake implementation? Are those capabilities standardized and available to all your data sources? Building these at a second stage can be a nightmare and lead to a lot of complex refactoring.

Companies that are facing these challenges and applying proactive solutions are crossing the chasm and beginning to deliver data lakes at a fraction of the usual cost. These organisations are achieving their vision for user-centric data lakes which will make high quality data quickly and easily available in order to speed the big data journey.