It may not be the best-known software company in the world, but half the world’s mainframes run Syncsort. The firm has been around for four decades, but now private equity backed, it sees a burgeoning opportunity to use its experience to get mainframe data into Hadoop for big data projects. CBR spoke with CEO Lonne Jaffe about big data integration.
Tell us a bit about your background prior to joining Syncsort.
I was with IBM for 13 years in a variety of operating roles. In my most recent role I was the vice president in the software group, leading a big part of the software mergers and acquisitions function, so I sourced and led more than half of IBM’s 2011 software acquisitions. I left in 2011 to go work for CA where I was running strategy for the company, both organic and inorganic strategy, so we did a couple of acquisitions and I managed the $3bn a year in organic spend and technical strategy work that we were doing.
The owners of Synsort, which are a few large private equity companies, recently took the step to separate the business, which is a little more than $100m in revenue, into two pieces. So there is the data protection company which is about 30% of the revenues, and the remaining piece, which I’m CEO of, is the data integration business. That is the mainframe software portfolio as well as this exciting new big data and Hadoop opportunity that we have.
Is the plan to sell the data protection business off altogether?
It’s for it to be a separating operating business. That could take a variety of forms… but with both businesses the goal is to have additional investment, so not looking to do a specific investment structure in the near term but it opens up all options there.
The major motivation is to be able to double down on the data integration business and invest more to grow organically the big data start-up that we have. The Hadoop product we have only got launched a couple of weeks ago – high performance acceleration which makes Hadoop easier to use and run a lot faster. The investors are looking to use the company as an anchor asset around which to do follow-up acquisitions.
My first impressions are that the technical team here is quite extraordinary, building industrial-scale data processing and taking that expertise to the world of big data. I was at a Hadoop summit a few weeks ago and it was really impressive to see the extent to which the Syncsort team was being embraced by the open source community. Cloudera is making us one of only two tier one partners. We’re particularly excited about our mainframe-to-Hadoop technology.
There’s a lot of hype around big data. What does it mean to you? Is it about the three v’s (volume, velocity and variety)?
The big data framework that we think about is 4 v’s – we add a v for value. If you think about the early adopters of big data, companies like Facebook and LinkedIn and Amazon, they’re consumer internet properties. Many of the concerns they had were solved by hiring armies of computer science PhDs to build custom software.
But as it’s crossing over and the technology is getting adopted by what I’ll call more grown-up companies like financial services companies or healthcare companies or the government, there is a whole class of challenges that the consumer internet companies didn’t have to deal with, such as compliance challenges, needing to be able to procure enterprise grade software that makes Hadoop easier to use as well as access your important data repositories that hold your most valuable data. Such as all of the data stored in mission critical mainframe repositories.
The history of Syncsort goes back to mainframe ETL…
The early product was a high performance sorting product on the mainframe. Sorting is a big part of what people are doing; if you look at the Hadoop MapReduce paradigm there’s the MapSort and ReduceMerge phases and they are both very sort-intensive programs. So a lot of the challenges that people had to deal with in the mainframe world which were optimisation down to the I/O, memory and network throughput levels – those challenges are re-emerging in the world of Hadoop. Our experience from the mainframe, but also the lightweight architecture of the product is very well suited to Hadoop.
It’s able to sit there resident on every node in the Hadoop cluster and speed up the processing of data tremendously while at the same time making it a lot easier to use, because you can create all the MapReduce jobs with a simple GUI instead of having to use tools like Pig and Hive and write that kind of code.
Is there a trade-off between ease-of-use and performance?
Typically there is a trade-off where you either do custom coding for better performance but it’s hard to maintain, or you use a GUI and end up doing code generation which in the world of ETL [extract, transform and load] is very inefficient. The nice thing about our architecture is that it’s a tool-based environment so it’s very productive but at the same time it’s much higher performance even than custom code – dramatically higher performance than Pig or Hive.
Do you have any benchmarks to back that up?
Yes, we’ve been doing performance benchmarks like TeraSort on the Hadoop cluster that we have in-house. We also have a dozen or so proof of concepts running at large banks and large telecommunications companies – they have industrial-scale Hadoop clusters and they want to find the technology that will perform better. We’re seeing orders of magnitude improvement over code generation techniques. In many cases it’s as high as 10x but certainly as much as 3x. And depending on the workload a 50- to 100% improvement in performance over custom coding techniques.
Another problem with custom coding in the world of Hadoop is that you’re typically writing Java code that means spinning up a Java virtual machine which is relatively inefficient. Even very good programmers are not able to write Java code that performs better than a C engine that’s been optimised over a very long period of time down to the chip and I/O and throughput level, that can run natively on every node in the cluster.
We can get your mainframe data into Hadoop, but not just that, we can bring in all the important metadata, and things like COBOL Copybooks. We can pull mainframe data into your Hadoop cluster without even installing any software on your mainframe – and that’s something no one else can do, not even Informatica.
You have said you will look at inorganic growth. Which areas might you look at?
There are lot of new functions that exist in the world of Hadoop that weren’t really relevant if you were Facebook or LinkedIn where you’re dealing with Tweets and status updates. Whereas if you are storing sensitive financial data, or patient data, or terrorism data on low-end commodity servers connected to the internet, there’s a whole array of interesting functions, like encryption, masking and data level protection, that become increasingly relevant. Those are some of the areas that we’re looking at inorganically now that we have this acquisition mission.
Have you started acquisition talks with anyone already?
Actually we have. That was one of my motivations for coming to Europe. We have more talks next week so we’re very active. That’s one of my top priorities right now. We want things that won’t complicate our ability to partner – we don’t want to alienate any partners – and it also needs to be ‘near adjacent’ and provide us with something that will continue to be high value for many years, not something that’s going to be commoditised by ‘ankle-biter’ competitors in the next couple of months.
I read one Gartner blog that suggested lots of Hadoop projects lack clear direction. What are you seeing?
I’ve been seeing both – folks who have built Hadoop clusters and are then looking what the use cases are, and others that are targeting specific use cases. Some are offloading capacity from a more expensive data management platform, others are starting Hadoop to do something that they just couldn’t do before because they didn’t have the ability to process the vast scale of data.
70% of what people are doing with Hadoop clusters today is ETL; then after ETL a whole bunch of use cases including ones like management and security software – gathering up lots of data and then gaining insights from it. That’s where you’re seeing companies like Splunk and Sumo Logic, which is a cloud-based version of Splunk, being so successful. They’ve basically built big data infrastructures and then skinned those to address the management software domain.
Why are companies using Hadoop for ETL?
A lot of companies using Hadoop for ETL don’t even think of themselves as using it in that way. The nature of ETL is that it basically involves taking data from somewhere, processing it and then loading it somewhere. For some people the ultimate destination for their data, to do analytics on it, is Hadoop. A number of products are coming online that allow you to query your Hadoop repository directly. Even in those cases you need to do a lot of processing of the data before it lands in the Hadoop distributed file system. What you’re doing in the transform stage there is merging and sorting, and aggregations.
Sometimes you’re using Hadoop to extract data from somewhere, let’s say your operational databases, then do some processing on it, say using MapReduce – because it’s a high-performance, low-cost platform – and then go and load that to another platform like Teradata where you might do your ultimate analytics.
And why do the processing in Hadoop and not just in Teradata?
It’s expensive. Trying to do ELT – extract the data, load the data and then transform the data – works with datasets that are not even modestly large. With big data the cost starts to become astronomical, and when you have to choose between spending an extra $5m to add Teradata capacity to handle your increasing data workloads, or spend the same amount of money building a Hadoop cluster which also allows you to do a whole array of additional things, that financial dynamic becomes a really compelling ROI to build Hadoop. It also allows you as an organisation to create skills and do a whole array of other things, not just what’s possible today but all the things that are coming online every month. It’s a really exciting time.