In today’s data rich society, we are presented with the luxury of storing all data, both old and new, into a data repository and going with the flow. Up until most recent years, the process of storing and sorting data was limited to following the ETL (extract, transform, load) design philosophy, which lead to transforming and summarising diverse data sets in order to populate data marts and data warehouses.
This process limited the storage of excessive data as each attribute or entity that was in the data warehouse or data mart was carefully thought through, justified and usage of it clearly articulated; the definitive value of stored data was determined.
The Convenience
However, this new concept of a data lake allows for vast amounts of diverse, raw data to be collected and stored in the data lake under the presumption that, in the future, there will be a need for it to solve problems and deliver answers to questions we have not thought about yet; the perceived value of data. Once needed, the data lake will have the ability to organise the required data, know where it came from, and define its value.
This "just-in-case" design paradigm allows for greater speed and flexibility, which is the driving force behind this enterprise-wide data management platform. Concentrating on data ingestion and harmonisation, data lakes aim to store dispersed data quickly, at a low cost and without any constraints. This new concept addresses and attempts to solve two key issues in the data-management space, one old and one new.
The old problem it tries to solve is information storage and with that, the cost of managing data. In contrast to traditional data-management processes that housed multiple independently managed collections of data, data lakes allow for the co-locating of these sources, which increases information use and sharing, while cutting costs through server and license reduction.
The new problem it attempts to solve is information access, to ensure that the numerous sources that result from all the stored raw data can be easily accessed once their potential for broader usage across the enterprise is identified. Essentially, providing faster capabilities to find answers to questions we haven’t thought of yet; creating data capabilities to go after unknowns.
A data lake allows a business to store all its data, both structured and unstructured, in the lake and then provides a basis for users to apply their own thought and analysis approach, using whatever technology is best suited to the task, to create data analysis views specific to a business use case.
This leads to one of the many benefits to the data-management table, its ability to provide users immediate access to all kinds of data; minimising the dependency on the IT, Enterperise Data Warehouse team, which in turn provides flexibility to the user to shape the data however they want to meet their requirements.
A further benefit is that a data lake significantly reduces data movement, as all data simply streams into the lake and stays there. The data is also not limited to being relational or transactional, as the data lake can contain any type of data, whether it be clickstream, machine-generated, external, social media, or even audio, video, or text.
The data lake empowers business users, and in essence within the enterprise it creates a "data democracy" as it speeds up delivery, enables business users to test numerous hypothesis quickly, and ushers in new types of data and technology that lowers the costs of data processing while improving performance. With today’s big data technologies, organisations now have an economically attractive option to bring any and all data into a single, scalable infrastructure model.
The Chaos
With more and more data being added to the lake, while simultaneously having multiple active users all accessing the data lake at the one time to create their own localised views, there’s a risk that the data lake may turn into a big data landfill. With hundreds or thousands of users, large volumes of data are acquired and/or created at the one time, and with likely little understanding of whom else is using a given dataset and why, there’s an immediate challenge in managing and governing the data lake.
The current design philosophy around Data Lake presumably gives less emphasis on automated metadata management, governance, lineage, traceability. In addition, since the data lake users data in democracy, there will be more and more ad-hoc usage of data.
Hence it is important to develop views around usage patterns so that when new users use the same data sources in data lake, they get an understanding of who has used these data sets before them, for what purpose they were used, a recommendation on the quality and relevance of the data and whether they recommend using the data sets for further analysis – in essence creating a user feedback and rating mechanism.
Unless such functionalities and best practices are implemented into the design and usage of the data lake, the chaos will come in as a hindrance to broader acceptance and usage of Data Lake as an enterprise data asset.