December 8, 2016updated 09 Dec 2016 4:47pm

The promise, potential & challenges of big data in healthcare

Big data is at the forefront of biotech innovation and helping researchers better understand our genomes.

The fevered hype around big data may have been a curse, an albatross around its neck that for many may have set lofty ambitions that it could never live up to.

Shawn Dolley, Industry Leader, Health & Life Science at Cloudera.

The hype has now died down but that doesn’t mean that work in the area has stopped. Big data is now increasingly influencing our daily lives in the background, many will never know that it is hard at work improving our transport systems, our healthcare, or a myriad of other areas.

But it is thanks to big data and the technology advances made in the plethora of different fields encapsulated by the buzzword that scientists are getting ever closer to answering some of the most difficult questions asked.

Healthcare is the perfect example of where big data is being put to good use.

CBR’s James Nunns spoke to Shawn Dolley, Industry Leader, Health & Life Science at Cloudera about the challenges, the projects being undertaken and where the UK is excelling.

JN: The big obstacle in healthcare is around making sense of unstructured data, and putting it to work. Can you talk me through your perspective on the route forward?

SD: “There’s a huge amount of patient case data stored as narrative text (dictated and transcribed, or directly entered into the system by care providers). In some systems, even lab and medication records are only available as part of the physician’s notes, largely because the nature of conversation between the provider and patient tends to be personalised and might not lend itself well to a drop-down or multiple choice type of data capture.

“The volume of unstructured clinical data is set to grow over the next decade as more and more mobile transcription solutions are rolled out along with speech to text translation. Add to this the fact that any clinical organisation that’s been able to access clinical notes from their partners in the clinical continuum may now, for the first time, choose to keep and accumulate this content since they can actually collect and mine it—adding to the total size of this rich corpus.

Content from our partners

Scan and deliver

GenAI cybersecurity: “A super-human analyst, with a brain the size of a planet.”

Cloud, AI, and cyber security – highlights from DTX Manchester

“In addition to doctor’s notes, there is an influx—due to the decreasing cost of storage and compute—of other types of unstructured data. This includes medical images, long strings of proteins or genetic nucleotides, social media forum chat, and other new types of data collected via smartphones.

“While using Search alone against clinical notes has some value, today’s best-in-class organisations use specific purpose-based solutions based on big data projects and principles. Some look for a purpose-built solution with curated ontologies and business-specific metrics and workflow, and others want a portfolio of software and support for a do-it-yourselfer.

“What Cloudera finds most often is health providers, health plan insurers, biotechs and health data processing companies are looking for technology partners who give them flexibility for the future.

“The truth is, we don’t know what the next healthcare data stream might be. Today we have researchers using Spark to figure out how to collect and mine video of every neuron of a zebrafish’s brain in real-time.

“Tomorrow, we won’t know what the unstructured data will be—the best we can do is give support to the best open-source, open-standard technologies, make sure they integrate with each other, and continue working with the organisations on the bleeding edge so we can provide excellent products for those on the leading edge.”

JN: Some of the big genomics topics of this year have included learning more about the Zika virus, and learning from Ebola. What part has big data played in this research work?

SD: “The role of big data in the fight against the Zika virus is around helping us to understand who has it, who will have it, what it is, how it is spreading, what happened, and how to avoid it in the future.

“It can even help us understand how to get vaccines to the right places before they’re needed. Unfortunately, each public health crisis advances our collective expertise in how to collect more data and make better use of big data techniques.

“Today, at the University of Texas in Austin, researchers have used Cloudera big data technology to build a computational pipeline to do the data processing to identify if Zika exists in a water sample. This research site has ingested all of the publicly available Zika data sets, as well as analysing air travel records to predict the most frequent movements of the virus.

“Big data is being used or can be used in virtually every part of viral epidemiology today. Sometimes this takes the form of the use of Spark as a coding framework for simulation models that might predict future outbreaks.

“Sometimes it takes the form of using Hadoop as the lowest-cost-per-terabyte for an active archive, which is key if we are collecting whole genomes of infected patients as well as viral genetic samples, images, social media outbreak signals and more. These data sets, especially when there is a molecular facet, become very large, and we don’t want to scale up the expense as we scale the volume.

“Sometimes new Apache open-source projects such as Record Service might be considered as a way to enable large scale extramural multi-tenant analytics of highly secure clinical data.”

JN: Typically this kind of research requires vast amounts of data, does this mean that only the most common diseases/viruses are likely to be looked at with a deep learning kind of approach

SD: “The deep learning around growing diseases like Ebola and Zika is already being pioneered, but in order to put to work the insights gained a wealth of data is important. While the technology available from Cloudera is of the most advanced and powerful, in order to gather meaningful insights, data scientists need breadth and depth of data, and this is where there is room to grow.

“This makes research of the more specialist, less widespread diseases inherently more challenging but all the more important. We have found that almost always the first step for any researcher, student or clinician is to put as much data in its native format, into a ‘data lake’.

“Technology’s job is to remove barriers to mashing up this data and enabling professional and by-necessity data scientists to apply relationships between data in their analytics and draw out the insights in the raw data.”

JN: How far away is a big data driven approach to healthcare from becoming the standard approach for all kinds of medical treatments and doctor patient interactions?

SD: “In many ways we are already pioneering new treatment methods in what is the most digitally advanced era. In comparison to 10 years ago, a lot has changed in how we aggregate, monitor and diagnose patient cases.

“Traditional healthcare IT solutions tended to be limited in scope and restricted to a particular source of data. However, one example of the unique ways the healthcare industry is bringing together data from an almost unlimited number of sources, and using data to build a far more complete picture of any patient, condition, or trend is Cerner’s EDH.

Cerner is using technologies such as Apache Kafka

“Cerner is the largest global provider of electronic health record technology. They happen to be the largest Cloudera healthcare customer, with over 2,000 nodes of Cloudera running against real-time data. Cerner offers a service to customers via their St. John’s Sepsis Agent solution.

“This real-time analytic algorithm constantly watches tens of thousands of patients around the world in real-time to see which ones are starting down a pathway toward blood infections (called sepsis). Sepsis is the number one cause of death to patients in hospitals.

“By using big data—in this case a huge ‘V’ for velocity—Cerner has, to-date, saved over 2,800 lives by notifying clinicians who can’t adequately monitor all their patients to watch for sepsis. Solving sepsis once it is spotted early is much easier than the task of collecting, streaming, analysing and alerting clinicians.

“However, big data has made it possible for the first time ever to use technology to keep a closer watch on patients than a nurse or doctor or family member ever could.”

Find out on the next page where the UK is leading the world.

JN: How big of a barrier to growing adoption is privacy and patient concerns regarding the use of their data? What can be done to overcome this?

SD: “The Cloudera solution Navigator Key Trustee allows companies and government agencies to securely transfer sensitive data to authorised parties. If a private citizen authorises an institution to share his or her personal data, Navigator Key Trustee can serve as the conduit and provides a much safer option than email or other digital file sharing sites, which can be easily intercepted or hacked.

“For example, a physician’s office could share authorised patient information with a hospital in another country. First, the organisations would have to validate the respective servers that are exchanging information with the Navigator Key Trustee server via GPG credential exchange.

“Once the validation process is complete, the physician can share the patient data by entering it in the secure server and establishing policies that allow access by only the validated server from the hospital. Intelligently, the time-to-live and retrieval limit policies can also be set up to ensure sensitive data is only retrieved one time.”

JN: How advanced is the UK compared to other countries when it comes to projects like this?

SD: “One area where the UK acts as a bellwether is around molecular big data, which is often thought of as ‘genomics’. The UK has an overweight share of innovations and inventions used to create molecular data. One of the icons of this is Oxford Nanopore, creator of the first realistic handheld genomic sequencer, the MINion. Also, one of the few genomic data structures that was designed from scratch for big data usage, called OpenCGA, was developed at Cambridge.

MinION is a portable DNA device from Oxford Nanopore Technologies.

“Further, the world’s first sovereign genomic program, the 100,000 Genomes Project, is a UK initiative. Once Genomics England LTD (GEL) was announced and began, a number of other countries such as China, United States, and France created similar programs.

“With a number of global pharma research and developments sites in-country, an academic environment rich in a tradition of biotech invention, large pools of genomic data at GEL and Wellcome Trust Sanger Institute, and a collaborative approach globally, the UK plays a key role in ensuring molecular big data innovation will continue apace.”