Data Ingestion; Just Because You Can, Doesn't Mean You Should

You shouldn’t start a data ingestion process just because you can, writes Computer Business Review’s Conor Reynolds.

Streaming data into your organisation (in batches or in real time) can have huge benefits, but only if you know why you want it.

Herein lies the first major pitfall that needs to be side-stepped when beginning a data ingestion process: the “because I can” approach.

(One risk: the “42” outcome: Deep Thought‘s answer to the question “what is the meaning of life, the universe, and everything?” The answer, as right as it is, was meaningless as the question was not defined)

Dr Leila Powell, senior data scientist at Panaseer told Computer Business Review: “Typically… people ingest lots of data, invest lots of time, with no clear plan of how to get value from it. This results in a very comprehensive, but very low value data lake/swamp.”

Cleanliness is Next to Godliness

In many cases data is also incomplete, not formatted correctly or littered with mistakes from when it was manually inputted by a human.

As Tim Kaschinske, senior product manager at BridgeHead Software puts it: “Typically, older systems allow users to enter data by hand and this often causes problems when ingesting it into a new system that expects data to be in a specific format and quality. Misspellings and differing date formats, for example, crop up time and time again, so it’s crucial to categorise data in order to ensure it’s well defined, resulting in a smooth ingestion.”

Read this: Tableau Gets Language Recognition, Data Cleaning Capabilities

He adds: “Healthcare settings, data identifiers such as first name, last name, DOB and sex are commonly used to ensure that historic data belongs to the right patient. For healthcare organisations, the accuracy of data cleansing and reconciliation is arguably more critical than in other sectors due to the potential grave consequences arising from errors.”

A Strong API Approach May be the Future

Currently most approaches to data ingestion consist of pulling data from silos and moving it to data warehouses. This whole process is slow and mistakes can easily be made, especially if the data is being pulled form a host of sources. For instance; a retail-centric enterprise will have data coming in from its POS system, online sales, mobile app, warehouse metrics, employee data and as IoT usage grows, the list will only get longer.

Read this: NAO Data Report: Gov’t Spends Up to 80% of Time Cleaning, Merging Data

Paul Crerand, director of solution consulting at MuleSoft tells Computer Business Review: “There must be a mechanism in place to enable insights to be constantly refreshed with the latest data from multiple systems across an organisation, whether those be running on legacy infrastructure or in the cloud.

“This requires organisations to think about data as being across domains, rather than particular systems; each domain can pull data from one or more internal systems, as well as third-party aggregators, such as social media feeds.”

He believes the key to enabling a domain-driven approach to data ingestion is the modern API.

“A major benefit of implementing an API strategy is the access it grants across cloud, on-premises and hybrid environments, as well as the ability to connect to any system, data source or device. By taking an API-led approach and connecting together their applications, data and devices, organisations will organically grow their application networks and be in a far stronger position to derive actionable insights in real time and gain a more comprehensive view of their data.”

Data Ingestion in Real-time or In Batches?

Real-time or batches? The best approach to data ingestion depends, once again, very much on use case. Dr Powell believes that the timescales involved can give a clearer view of what the right question is.

Considering timescales makes you ask, when do you need to review the data and when do you need to use the data? “If you’re creating a weather forecast, this [real-time] ingest frequency may well be appropriate” Dr Powell notes.

If you are tackling data from legacy systems or silos, moving it in batches may not be just the right fit, it could be the only option, unless you want an incomplete, incorrect digital mess.

Michael Noll technologist at Confluent told us that: “The devil that lurks in the detail is that legacy datasets often come in a range of divergent formats, which are often not well documented.

“These need to be transformed into a single lingua franca that is easier for different systems to consume. Stream Processors like KSQL are a useful solution to this type of problem. A query can be defined using a familiar SQL dialect and applied to transform datasets as they move through the organization, standardising them and protecting consuming applications from their original legacy form.”

Taking data in batches gives teams the time to clean and format data before integrating it into a new system. Ingesting data is also time and compute heavy: moving data in batches gives firms the option to put the strain on systems during off-peak times, ensuring it doesn’t cause interference with core business operations. (Lambda architecture; a big data management approach, works like a hybrid of real-time and batch data ingestion.)

Real-time data ingestion in today’s world is largely where it’s at: organisations and firms want data insights that can be actioned almost instantaneously.

Traditional business are being disrupted by data driven competitors crunching the data to give users’ a smoother, faster experience.

“When today’s customers click buttons they expect things to happen there and then,” Michael Noll notes.

“This move to real-time data has driven hundreds of thousands of enterprises to put technologies like Apache Kafka at the core of their systems to collect, store, and process this real-time data. When you order a taxi, watch a blockbuster movie, or make a payment to buy some online goods, it is more than likely that Kafka will be making that interaction happen. None of these systems would be able to provide the same customer experience if they ran on 24-hour batch windows.

Fine if you know exactly why you want that data. Enterprise strategy, then data management tactics, the experts largely agree…