Historically, data science has been performed using tools that evolved over the last 30
years. These tools were designed to work on small sets of data and to perform limited calculations when compared to modern solutions. Over time, however, data scientists became experts in these tools and, as a consequence, the tools became THE way to do data science. But something happened along the way; data didn’t just become big; it became huge!
Imagine if you had tools and machines to surface mine precious stones, but then were suddenly asked to use the same tools to dig the Mponeng gold mine, which is 2.5 miles deep and has 236 miles of tunnels. How would the engineers – who spent years learning how to use their legacy tools – solve the problem? They would split into two camps. One group would insist that the existing, ‘known’ tools were fine if more effort and horsepower were applied; while the other would decide that a whole new set of methods and equipment was required. When this group of innovators goes in search of their new equipment, however, they would find a fresh obstacle: most legacy mining products haven’t been improved to scale in a way that will meet modern demands. They are just not fit for purpose from a technical perspective, and so they will have to find new, emerging tools.
Many companies are facing similar issues with data mining. Data is growing exponentially and scientists are relying on sampling and the use of aggregated data because they are insistent on using legacy tools to solve a problem that is many times bigger than they have seen before. Because of these limitations, many companies are not able to mine at a level of scale and granularity that will produce optimal insights.
Those that are open to both changes in methods and technology, however, are reaping the benefits.
Take, for example, a very large UK high street retail grocery company who wanted to reduce their loss on perishable foods. Using their legacy tools they were able to model at a level of product category and region, which took them about 50 hours in total. Using NEW tools, however, meant they could interrogate the data down to SKU and store level all within 5 hours, something which would not even have been possible using legacy tools. With this level of detail, the company was able to set the stock volumes for individual items at each store on a weekly basis. The additional insight saved them millions of pounds per year by reducing the amount of loss, and that was only possible because they were able to mine data at a low level of detail and take action on the results.
In another example, a large healthcare company wanted to assess the quality and efficiency of care of doctors using billions of rows of claims and related data. Using legacy tools they split the project into 26 individual processing jobs which took 6 weeks to run. Because of the work involved, they were only able to check on the quality of patient care twice a year. With new tools, however, they ran the entire process in 7 minutes and now assess quality of care weekly; significantly reducing their risks and helping them quickly identify cost irregularities. They’ve now taken the project a step further. When their staff visit physicians, they use mobile devices to pull up the individual doctor and practice scores on demand and do real-time consulting to address issues.
In a final example, a marketing company wanted to micro-target their customers. For some companies, all customers get the same offer. Many companies are using basic segmentation, such as age or gender, to tailor offers, but more sophisticated companies use behavioural segmentation, which offers an even more granular approach to targeting. The real value is in micro segments, which allows the creation of many segments with tight groupings of behaviour. To do this level of segmentation, the company needs to mine billions of rows of data and perform a large number of calculations. Once the analysis is complete, the company will not target macro segments with semi-generic offers, but rather micro segments with very specific offers. They estimate that, using this low level of targeting, they will produce millions of pounds per year in revenue.
In all of these examples, the miners shared two common traits:
- They were willing to break the “we do it this way” culture
- They were willing to explore new technology solutions.
For them, it was clear that the value of analysing data at low levels of granularity was worth the cost associated with change.
For companies looking to improve their mining technology, there are a few places to start.
In-database technology, especially on parallel platforms such as Hadoop, Teradata and Actian Matrix is a clear contender for mining very large sets of data. Performance results typically show a 10X or 100X improvement. For example, a large digital market analytics company performs market basket analysis on 486 billion rows of data in 17 minutes using in-database solutions versus 30 hours using old mining technology.
For computationally intense data mining, the use of graphics processing units is on the rise. From a price-to-performance aspect, nothing outperforms GPUs. In one example, a grid computing system using the latest hardware performed 100 million pre-trade risk calculations in 5 minutes; 2 GPU cards did the same calculation in less than 13 milliseconds.
The granular analysis of huge amounts of data using new methods and technology produces measurable results. To realise these benefits, companies need to shift from surface mining techniques and tools to those that allow deep exploration.