Paperspace COO Daniel Kobran on AI, GPUs, ONNX and Containers

AI is extremely powerful, but it’s just too complex for the average developer — the infrastructure bottlenecks, lack of common deep learning framework and more all conspire to make it highly specialised terrain.

Brooklyn, New York-based Paperspace is one of a growing number of startups making a play for an arena largely dominated by major cloud providers, with the aim of abstracting away infrastructure and supporting many different processing architectures — CPUs, TPUs, GPUS, pre-emptible instance types, the big clouds — with an emphasis on good user experience.

Computer Business Review fired some questions at Paperspace co-founder Daniel Kobran, whose company provides an Infrastructure-as-a-Service offering for performance-intensive design, visualisation, and AI apps.

You’re in a competitive marketplace. What’s your offering?

We created Paperspace to abstract the complexities of GPU compute so developers could overcome these bottlenecks and focus on development.

GPUs have unlocked new possibilities and workflows for the developer community, but these new applications do not necessarily fit well with the web services/devops model. Researchers and developers were spending most of their time managing infrastructure instead of developing deep learning models. Our main product offering called Gradient° aims to put Facebook-grade AI tooling in the hands of every developer.

How often do you find enterprises promising AI-ready data don’t actually have it ready/available?

Access to clean, usually structured data is one of the first step to building out a true AI pipeline. We see many companies using public datasets, retraining/refitting existing models on their own (smaller) dataset, or even going through data brokers to buy the datasets that they need. Most companies today actually collect a lot of data, but the process of making this data usable in the sense that an AI algorithm can operate on it is a huge challenge for even very technically-sophisticated companies.

What are the major challenges around developing AI / ML applications?

The challenge we’re addressing is not unlike the early days of Web development. It’s easy to find compute today given the ready availability of GPU farms on Amazon and TPU installations in Google’s Google Cloud. The missing component is software. This is a serious challenge that many people are encountering.

We’ve witnessed first-hand how difficult it can be to develop a single AI model and then deploy/embed it in to an existing business process.

For example, a junior AI engineer can work on implementing 15 lines of Facebook’s PyTorch ML framework and gain access to cutting-edge research. But to pull it all together with the other codes needed, you end up “home-rolling” it yourself, finding multiple open-source tools and then hacking them all together.

What are developers looking for in AI / ML development tools today?

Ease of use rises at the top of the list for most developers. Second is a common framework, which is still being defined and remains a major challenge today.

The number of tools available is rising exponentially and they don’t always play well together. PyTorch was not even on the radar when we first got into the machine learning universe and then out of nowhere, it kind of blew up. NVIDIA GPUs dominate today, but it’s inevitable that newer architectures will come out, whether it’s from NVidia, Intel, or Graphcore. It has been described it as a “Cambrian explosion” of tools and we believe that that description is still accurate today.

How easy is it for developers today to get started developing AI / ML applications?

Easy to get started but not easy to make production-ready. Getting started with deep learning today entails setting up infrastructure like Kubernetes, a job queuing system and tools to version data and models, among other painstaking tasks.

Even just getting Jupyter up and running, and making sure you are not getting charged hundreds of dollars for GPU instances can be a challenge.

The problem is not access to hardware, that’s plentiful. The problem is the software and common frameworks to tie it all together.

What role do the big clouds – AWS, MSFT, Google – play here? Can you compete?

AWS, Google, MSFT all have one thing in common: they are powerful platforms and add complexity to an already complex process. AWS has Sagemaker; MSFT has Azure Studio; and Google has AutoML. All three platforms claim to be easier to use, but are really designed for developers at large corporations with ample resources — massive DevOps and software teams — not for the average developer. Not only that, but these platforms lack the on-premise and multi-cloud support enterprises need today.

AI researchers and experts in areas like stats, TensorFlow, GANs, etc., can spend 95 percent of their time managing infrastructure instead of developing deep learning models. This is ludicrous considering AI research is one of the fastest growing professions. As mentioned, just to get started with deep learning today, there’s a huge number of painstaking tasks required. This is fine for large companies, but not feasible for small shops or independent developers that are looking to build ML pipelines.

Why are GPUs so exciting?

GPUs can provide a type of parallel computing that wasn’t readily available in the past. The chips, which provide high bandwidth to address large data sets, are becoming widely adopted for use in training deep learning neural nets. It’s a whole new ballgame for just about everyone, including software developers. Ultimately, we may be entering a golden era of hardware, but it means taking on a lot of infrastructure management.

GPUs are just part of the complexity to come; over the next couple of years, the number of heterogeneous hardware options will increase, which is also a good reason to employ a technology architecture that will abstract the hardware away.

Anyone interested in ML can spend enormous amounts of time building the pieces and keeping them together. We’re committed to finding ways to build out that infrastructure as part of a cloud service.Among deep learning frameworks, Google’s TensorFlow jumped out to an early lead, but recently PyTorch has emerged and is now right there with TensorFlow. Along the same lines, NVidia GPUs and the associated CUDA tooling are in high demand at the moment, but other processor types are currently in the mix, or at least on the drawing board.

What would you say are the top trends in AI / ML development in 2019?

We’re intrigued by trends in machine learning to abstract away from server infrastructure. Something we’re watching very closely is the emergence of ONNX, the industry effort to define a common translation model between different machine learning frameworks. It’s definitely a time of tremendous ferment in the industry.

The appropriate level of abstraction is largely undefined today in terms of the division of resources between GPU and containers and machine learning frameworks and software tools.

We also believe 2019 will be the year AI makes an impact in the enterprise. Things are changing quickly. Companies are popping up left and right tackling core business problems like sales, marketing and more. What was mostly a speculative project even a year ago has become increasingly relevant for companies that need to maintain a competitive edge. We’re also confident that AI Developer will be a specialized role — and in high demand. AI is becoming a core part of every business to the point where the AI developer role will be a critical position at every enterprise. These new developers specialize in newer techniques such as deep learning and will sit closely to the Data Scientists/ Data Engineer/ and BI teams to help deliver real ROI on big data projects.

“Just to cite one real-world example of how ML is catching on, machine learning (CS 229) is now the most popular course at Stanford”

It’s also clear that 2019 will be a make or break year for DevOps – a topic of discussion for years now with little in the way of results. But AI is creating a demand for even more agile development. This is an opportunity for DevOps to shine. Taking models from R&D to production will be the name of the game. DevOps teams that can build quick turnaround for AI teams will prove the value in the org.

In 2019, Kubernetes will also take AI development to new heights. We view containers and AI as a match made in heaven. AI applications are hard to build and consume a lot of resources. Containers are easy to scale, they’re portable across a range of environments — from development to test to production — and, therefore, enable large, monolithic applications to be broken into targeted, easier-to-maintain microservices.