Machine learning (ML) is hard and it’s messy. It’s hard to move models to production, due to a diversity of deployment environments; it’s hard to track which parameters, code, and data went into each experiment to produce a model and it’s generally something Talked About more than Done in most businesses.
As a result, Big TechTM has been building internal machine learning platforms to manage the ML lifecycle. Facebook, Google and Uber, for example, have built FBLearner Flow, TFX, and Michelangelo respectively to manage data preparation, model training and deployment in contained environments.
Even these, as San Francisco-based Databricks’ Matei Zaharia puts it in a blog today, are limited: “Typical ML platforms only support a small set of built-in algorithms, or a single ML library, and they are tied to each company’s infrastructure. Users cannot easily leverage new ML libraries, or share their work with a wider community.”
The Keys to the ML Castle?
San Francisco-based data streaming specialists Databricks has a track record of trying to democratise difficult tech (see our recent interview with CEO Ali Ghodsi) and today it’s releasing not just that pleasingly lucid blog, but the alpha version of a new open source, cloud-agnostic toolkit designed to simplify ML workflow.
The toolkit, called MLFlow, allows organisations to package their code for reproducible runs and execute hundreds of parallel experiments, across any hardware or software platform. It integrates closely with Apache Spark, SciKit-Learn, TensorFlow and other open source ML frameworks.
Releasing the tool today at the Spark + AI Summit in San Francisco, CEO Ali Ghodsi said: “To derive value from AI, enterprises are dependent on their existing data and ability to iteratively do machine learning on massive datasets. Today’s data engineers and data scientists use numerous, disconnected tools to accomplish this, including a zoo of machine learning frameworks.
He added: “Both organizational and technology silos create friction and slow down projects, becoming an impediment to the highly iterative nature of AI projects. Unified Analytics is the way to increase collaboration between data engineers and data scientists and unify data processing and AI technologies.”
Engineering giant Bechtel is an early customer, the company said, with Bechtel’s principle Big Data architect, Justin Leto saying it “provides our data scientists with usable data, and keeps our engineers focused on AI solutions in production instead of troubleshooting ops issues.”
As Zaharia puts it: “MLflow is designed to work with any ML library, algorithm, deployment tool or language. It’s built around REST APIs and simple data formats (e.g., a model can be viewed as a lambda function) that can be used from a variety of tools, instead of only providing a small set of built-in functionality. This also makes it easy to add MLflow to your existing ML code so you can benefit from it immediately, and to share code using any ML library that others in your organization can run.”
Did somebody mention GitHub? Here’s the MLFlow repo.