You founded scalable recommender system Myrrix in 2012 before it was acquired by Cloudera this year. How did that come about?
I’m an engineer by trade. I worked with Google from about 2004 to 2008, so I used to MapReduce, got used to the big data thing before it was a big thing. Along the way, my background’s also computer science and machine learning, so I started working on this open source project that eventually became Apache Mahout, which is a large scale machine-learning project in the Hadoop eco system.
It kind of reached its peak about two years ago. It’s hard to make that get-up project go any further. But the demand wasn’t going away and certainly the market started to catch up.
Hadoop was only one piece of the puzzle. If you build models and not do anything with the models, so we figured why wait, let’s just get started. So we spun out, started a small company that commercialised a sort of re-write as part of Mahout, adding real-time capabilities on top of it as a nice package product. And sure enough there’s a demand for that and that’s starting to be something that resonates with customers that understand Hadoop and now want to take the next level.
Happily that was also just about exactly what Cloudera was wanting to get into a bit now, so it made a lot of sense to join forces.
Did they approach you?
I would say so, although for a long time I had known who Jeff Hammerbacher (Cloudera’s chief cscientist) is, and I know Josh Wills, the other data scientist on the west coast. So it’s a small world. I knew who they were and they knew who I was for quite a while.
You must’ve been pretty successful in that first company though?
In the grand scheme of things, it was pretty early, pretty small. I think it was the right place at the right time. But it didn’t get very far.
What have been the differences between working for a smaller start-up company and the slightly more established Cloudera, which was founded in 2008?
Cloudera’s a couple of hundred people now, but it’s still not very big. It’s certainly bigger than working on your own and then with one other person, it’s still quite different. I think it’s a happy medium; I’m used to working in larger companies, like Google. When I joined Google it probably had 3,000 employees and when I left it was probably 20,000. So this is still small by my standards.
So what have been the main differences for you? Has it been easy to adjust?
Yeah, I think it has. Part of the reason that’s true is even Cloudera is not totally sure how to think about data science. What that means, what the market wants etc. So there’s a couple of us in the company doing data science and it seems to be positive, but it’s a bit of a blank canvas. It’s a small enough company and it’s growing and people are happy to let you go off and explore this new patch of market called data science and figure something out.
So what exactly is your job and what are the challenges that you’re facing?
Director of data science, so in practise what that has meant in the past is really an advisory role, so Cloudera self-support and training around Hadoop. We don’t really get into building applications for customers, but we will provide some professional services. So a customer comes in and says ‘we’re a customer, we got all the data and now what?’ We can go in and selectively advise about how to do that. There’s a couple of us, we can’t really go talk as much as we’d like with every single customer. We’d love to get stuck in with some interesting problems to solve but there’s just not the time.
So I think the concern has turned to trying to build software, build infrastructure that could be repeatedly used by customers to accomplish these things, so we’ve turned a little more internally towards software development.
Big data has been around for a long time, but the term itself is one of these new buzzwords that people didn’t use that many years ago. Do you think your clients really understand what it is or just think of it as a buzzword now?
Big data was our term, there’s always been big data and we’ve always had a bit more data than usual to deal with. I guess what isn’t new is the sheer volume. Things have always gotten bigger but the internet and the amount of data you can collect if you want has really gone through the roof. The real difference now is, I think before people treated data science, it wasn’t really something that interacted with your operations. Scientists would come in, sample some data, go off to the lab and tell you something. Now people are engaging and they expect that all the stuff is operational. That’s something I’m not sure people have their head around just yet. But that’s actually pretty new.
So they’re coming to market, they’ve got this data, and they ask can I use SaaS to operationalise learning on this data? Not really, it’s a bit harder. That’s not what those things were for and that’s actually not a good solution yet. So that’s the gap we’re trying to address.
What kind of problems are companies dealing with as far as big data is concerned?
I think there’s some obvious user cases you’d expect. There are commerce companies that want to segment their customers, they want to target customers, so that’s common. I can think of three other customers in finance that are trying to find anomalous data, usually for fraud protection. At least one of our guys is deeply involved in bio-tech and medical applications, which is something I actually don’t know a lot about. So it’s a little all over the place. I think part of our mission is to help people converge.
So when customers approach you and they tell you about the challenges they’re facing, what is it you guys do for them?
Right now most of our business is just the selling and support and training on Cloudera. If someone says that they’re interested in data science, we’ll typically get involved just to advise at first. There’s actually not that many customers that are really ready to operationalise learning. That’s their goal. They’re dreaming big but they’re really still in their early stages. But there are a handful that are truly ready.
Generally speaking, what kind of top tips could you give companies about managing their data in the context of data science and learning?
Now, often what we do is we see problems specific: here’s the data and here’s the problems we want to solve. But if I were trying to come up with some general principles for people who were interested in collecting data, I think probably the over-arching advice is to start collecting data now. In a year, if you have your learning system ready to go, you don’t want to start learning with zero there. You want to have learned on data for as long and as fast as you can. So start now, start getting all your data out of systems that are hard to access and into something like Hadoop.
That’s the key prep work I would recommend to companies if they’re interested in starting out down this route. Once you have the data marshalled up and you’ve got people who understand it, the rest isn’t actually that hard. The hard part really is getting the data out. There’s a bit of difficulty in building the system and things like that, but that’s really the challenge.
So what are the plans for Cloudera in the next year or so?
At the least, we keep doing what we do. For data science, we’re certainly gonna keep doing what we do, which is giving advice, supporting our solutions architects as they help customers. I’d like to get to the point when we can release software. It would be great if we could put together the kinds of things you saw with Myrrix, which we have internally, really do that properly, really be able to offer that to partners and customers who want to build on top of it. So they’re not trying to do it all from scratch. Certainly nothing imminent but I would like to have something more tangible to offer, like software, to our customers.
What are the main target markets that you’re working with?
For Cloudera, there’s no particular vertical focus. The demand for data science is coming in through finance and bio-tech – retail as well. I think that’s going to continue to be the case. These are the people with money on the line, resources to invest and data storage processing and time to build these projects.
What geographical markets are you primarily dealing in?
Cloudera’s headquarters is based in Palo Alto, California. There are offices in a lot of places, such as San Francisco – it’s a US company for sure. But out biggest office in Europe is in London, in Tech City. There about 20 of us now, mostly sales.
So are you getting more business in Europe now?
Yes, definitely. Mostly because up until a couple of months ago there wasn’t nearly as much presence here, that’s only happened quite recently. It’s probably true that there’s a little bit of a lag between Europe and the US, so Cloudera has figured out the market in the US and now it’s quite naturally the time to copy that into Europe.