The fevered hype around big data may have been a curse, an albatross around its neck that for many may have set lofty ambitions that it could never live up to.

Shawn Dolley, Industry Leader, Health & Life Science at Cloudera.
The hype has now died down but that doesn’t mean that work in the area has stopped. Big data is now increasingly influencing our daily lives in the background, many will never know that it is hard at work improving our transport systems, our healthcare, or a myriad of other areas.
But it is thanks to big data and the technology advances made in the plethora of different fields encapsulated by the buzzword that scientists are getting ever closer to answering some of the most difficult questions asked.
Healthcare is the perfect example of where big data is being put to good use.
CBR’s James Nunns spoke to Shawn Dolley, Industry Leader, Health & Life Science at Cloudera about the challenges, the projects being undertaken and where the UK is excelling.
JN: The big obstacle in healthcare is around making sense of unstructured data, and putting it to work. Can you talk me through your perspective on the route forward?
SD: “There’s a huge amount of patient case data stored as narrative text (dictated and transcribed, or directly entered into the system by care providers). In some systems, even lab and medication records are only available as part of the physician’s notes, largely because the nature of conversation between the provider and patient tends to be personalised and might not lend itself well to a drop-down or multiple choice type of data capture.
“The volume of unstructured clinical data is set to grow over the next decade as more and more mobile transcription solutions are rolled out along with speech to text translation. Add to this the fact that any clinical organisation that’s been able to access clinical notes from their partners in the clinical continuum may now, for the first time, choose to keep and accumulate this content since they can actually collect and mine it—adding to the total size of this rich corpus.
“In addition to doctor’s notes, there is an influx—due to the decreasing cost of storage and compute—of other types of unstructured data. This includes medical images, long strings of proteins or genetic nucleotides, social media forum chat, and other new types of data collected via smartphones.
“While using Search alone against clinical notes has some value, today’s best-in-class organisations use specific purpose-based solutions based on big data projects and principles. Some look for a purpose-built solution with curated ontologies and business-specific metrics and workflow, and others want a portfolio of software and support for a do-it-yourselfer.
“What Cloudera finds most often is health providers, health plan insurers, biotechs and health data processing companies are looking for technology partners who give them flexibility for the future.
“The truth is, we don’t know what the next healthcare data stream might be. Today we have researchers using Spark to figure out how to collect and mine video of every neuron of a zebrafish’s brain in real-time.
“Tomorrow, we won’t know what the unstructured data will be—the best we can do is give support to the best open-source, open-standard technologies, make sure they integrate with each other, and continue working with the organisations on the bleeding edge so we can provide excellent products for those on the leading edge.”
JN: Some of the big genomics topics of this year have included learning more about the Zika virus, and learning from Ebola. What part has big data played in this research work?
SD: “The role of big data in the fight against the Zika virus is around helping us to understand who has it, who will have it, what it is, how it is spreading, what happened, and how to avoid it in the future.
“It can even help us understand how to get vaccines to the right places before they’re needed. Unfortunately, each public health crisis advances our collective expertise in how to collect more data and make better use of big data techniques.
“Today, at the University of Texas in Austin, researchers have used Cloudera big data technology to build a computational pipeline to do the data processing to identify if Zika exists in a water sample. This research site has ingested all of the publicly available Zika data sets, as well as analysing air travel records to predict the most frequent movements of the virus.
“Big data is being used or can be used in virtually every part of viral epidemiology today. Sometimes this takes the form of the use of Spark as a coding framework for simulation models that might predict future outbreaks.
“Sometimes it takes the form of using Hadoop as the lowest-cost-per-terabyte for an active archive, which is key if we are collecting whole genomes of infected patients as well as viral genetic samples, images, social media outbreak signals and more. These data sets, especially when there is a molecular facet, become very large, and we don’t want to scale up the expense as we scale the volume.
“Sometimes new Apache open-source projects such as Record Service might be considered as a way to enable large scale extramural multi-tenant analytics of highly secure clinical data.”
JN: Typically this kind of research requires vast amounts of data, does this mean that only the most common diseases/viruses are likely to be looked at with a deep learning kind of approach
SD: “The deep learning around growing diseases like Ebola and Zika is already being pioneered, but in order to put to work the insights gained a wealth of data is important. While the technology available from Cloudera is of the most advanced and powerful, in order to gather meaningful insights, data scientists need breadth and depth of data, and this is where there is room to grow.
“This makes research of the more specialist, less widespread diseases inherently more challenging but all the more important. We have found that almost always the first step for any researcher, student or clinician is to put as much data in its native format, into a ‘data lake’.
“Technology’s job is to remove barriers to mashing up this data and enabling professional and by-necessity data scientists to apply relationships between data in their analytics and draw out the insights in the raw data.”
JN: How far away is a big data driven approach to healthcare from becoming the standard approach for all kinds of medical treatments and doctor patient interactions?
SD: “In many ways we are already pioneering new treatment methods in what is the most digitally advanced era. In comparison to 10 years ago, a lot has changed in how we aggregate, monitor and diagnose patient cases.
“Traditional healthcare IT solutions tended to be limited in scope and restricted to a particular source of data. However, one example of the unique ways the healthcare industry is bringing together data from an almost unlimited number of sources, and using data to build a far more complete picture of any patient, condition, or trend is Cerner’s EDH.

Cerner is using technologies such as Apache Kafka
“Cerner is the largest global provider of electronic health record technology. They happen to be the largest Cloudera healthcare customer, with over 2,000 nodes of Cloudera running against real-time data. Cerner offers a service to customers via their St. John’s Sepsis Agent solution.
“This real-time analytic algorithm constantly watches tens of thousands of patients around the world in real-time to see which ones are starting down a pathway toward blood infections (called sepsis). Sepsis is the number one cause of death to patients in hospitals.
“By using big data—in this case a huge ‘V’ for velocity—Cerner has, to-date, saved over 2,800 lives by notifying clinicians who can’t adequately monitor all their patients to watch for sepsis. Solving sepsis once it is spotted early is much easier than the task of collecting, streaming, analysing and alerting clinicians.
“However, big data has made it possible for the first time ever to use technology to keep a closer watch on patients than a nurse or doctor or family member ever could.”