How many ‘v’s can one come up with to describe Big Data? In a 2001 research report, META Group (now Gartner) analyst Doug Laney described Big Data as being three-dimensional: increasing volume, velocity (speed of data in and out), and variety (range of data types and sources). Since then some have added another ‘v’ for veracity – arguing that Big Data is worthless without the data being accurate – and another for value (data’s value often declines over time, for example). But with all the hype around Big Data from the vendors trying to cash in on the new craze, its definition has become somewhat blurred. So let’s add another ‘v’: vague.
But despite the hype Big Data is still real. Technologies like the open source Apache Hadoop distributed file system are accelerating in popularity. 26% of organisations are already using it, with a further 45% considering it seriously, according to a survey by Hadoop data analysis firm Karmasphere. Another study by research firm Tachaisle among mid-market firms found 18% are currently investing in Big Data, with an additional 25% planning to do so. It was estimated spending on Big Data will surpass $3.6bn annually by 2016 – and that’s just among mid-market firms.
Real or not, that doesn’t mean there aren’t firms out there that are making it sound like Big Data can magically, and cheaply, answer all of their data analysis prayers. As data integration firm Informatica’s CEO Sohaib Abbasi put it when I met up with him this month, many of the Big Data players are making, "remarkable claims" about their technology.
"If you look at the world of analytics it is a very complicated world," he told me. "There are six ways to do analytics. You’ve got traditional data warehousing from vendors like Teradata. You’ve got built-for-purpose analytic databases from the likes of EMC Pivotal, Greenplum. You’ve got in-memory databases such as SAP HANA. There is agile BI from people like QlikTek. You’ve got Hadoop which is of course one of the most hyped technologies. And then you’ve got cloud web services like Amazon Redshift. Each one of those vendors is making some remarkable claims, that they can do it all," he said.
Either way, analytics is only one element in any Big Data project. Companies would be unwise to overlook a fairly obvious point: if you want to analyse data in Hadoop you need to get that data into Hadoop in the first place. Moreover you may want to keep that data relatively in-synch with production data, and you may then want to get it out of Hadoop into another system at a later stage.
You may want to bring in data from multiple sources and combine it, and you may want to analyse it from multiple systems too.
We used to call this type of thing ETL – extract, transform and load. I’ll stick my neck out and make a prediction here: it won’t be long before we’re talking about a new acronym in the form of BDI, for Big Data Integration.
Integration companies like Informatica have realised there’s a big opportunity in Big Data. As its CEO says, "In this world there is growing demand for Informatica because we play a role. In fact one of the CIO’s I met with said the world of analytics is a world of insanity. The only sanity is Informatica. It’s this way of saying that our neutrality reassures him. Because regardless of the ways they analyse data they can rely on Informatica to help them ensure they have trustworthy data, accurate data, timely data and holistic data."
Informatica is the gorilla in the data integration space but there are plenty of competitors. I also met the CEO and president of Actian, Steve Shine, this month. The company formerly known as Ingres has also realised that you can’t attempt Big Data without Big Data integration. He told me how the company has just launched two brand new platforms: the Actian DataCloud and ParAccel Big Data Analytics Platform, which it says will help customers tackle the challenges of what it calls the ‘Age of Data’.
The two platforms are claimed to enable the various data management and integration bits and pieces needed for effective Big Data analytics, such as data connection, through data preparation and all the way to automated action.
"Companies can predict the future, prescribe the next best action, prevent damage to their business and discover hidden risks and opportunities," Shine said. "Business users are able to run sophisticated analytics and cooperative processing without deep data science skills."
Skills are a particularly thorny issue in the Big Data space – there just aren’t enough Hadoop experts to go around. Likewise there will be few that have the skills to extract, transform and load data into Hadoop clusters using manual coding. No surprise then that open source integration vendor Talend has a product called Talend Open Studio for Big Data.
It says it greatly simplifies the process of working with Hadoop: "With Talend’s open source big data integration software, you can move data into HDFS or Hive, perform operations on it, and extract it without having to do any coding. In the Eclipse-based Talend GUI, you simply drag, drop, and configure graphical components representing a wide array of Hadoop-related operations, and the Talend software automatically generates the corresponding code."
I also spoke to the recently-installed CEO of Syncsort, Lonne Jaffe about Big Data integration. While the company has actually been around for four decades mostly doing mainframe data integration, sorting and protection, its new tagline is ‘Integrating Big Data… Smarter’. Jaffe told me that nearly all of the company’s Big Data integration customers have an ETL challenge: they are essentially moving data into Hadoop and then moving it out again for further analysis. Many want to be able to perform Big Data analytics on data that’s in the mainframe environment, and it just so happens Syncsort knows more than most about mainframe integration. You might call it a happy coincidence.
"We can get your mainframe data into Hadoop, but not just that we can bring in all the important metadata, and things like COBOL Copybooks," Jaffe told me. "We can pull mainframe data into your Hadoop cluster without even installing any software on your mainframe – and that’s something no one else can do, not even Informatica."
These aren’t the only companies talking about Big Data integration. From niche players like Composite Software right up to the giants like IBM, Big Data integration is gaining speed. Informatica, meanwhile, has even changed its company tagline from ‘The Data Integration Company’ to ‘Put potential to work’.
Its CEO, Abbasi, told me how the company’s latest product launch is right in the Big Data integration sweet-spot. "What we did is launch the industry’s first virtual data machine, called Vibe," he said. "The value of Vibe is map once, deploy anywhere – so whatever you do using Informatica, you could map it once and deploy it on traditional middleware, deploy it to run on a database, or on Hadoop, or to run in the cloud, and you do it just once. There are no new skills required if developers are already using Informatica.
As proof points, Facebook uses Informatica for Hadoop, and all sorts of conventional businesses like Western Union are using it too.
"Informatica provides an abstraction between where the data resides and how it gets processed. Vibe provides near-universal connectivity from things like the mainframe right through to the internet of things," Abbasi said. "You map it once and underneath the covers we’ll deploy it anywhere you specify. It’s a unique differentiator that we can offer today."
Just as cloud didn’t completely remove the challenges around on-premise data integration and management, it seems if anything Big Data is creating more integration challenges too. So while technologies are certainly emerging that can help, the cost and complexity of the data integration portion of any Big Data project shouldn’t be underestimated. Big Data has more than enough ‘v’s already. It’s time more focus was put on the ‘i’ of Big Data: integration.