Building a Better Delta Lake with Talend and Databricks

With the introduction of Delta Lake last year, a well-tested pattern of building out the bronze, silver, and gold data architecture approach has proven useful. This session will review how to use Talend Data Fabric to accelerate the development of a Delta Lake using highly productive, scalable, and enterprise ready data flow tools. Covered in this section are demonstrations of ingesting ‘Bronze’ data, refining ‘Silver’ data tables, and performing Feature Engineering for ‘Gold’ tables.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hi everyone, I’m excited to give you an update on MLflow, the open source machine learning platform.

4 Simplifying Model Development and Management

So I’ll start by talking today about the need for machine learning platform and what these software systems are, then I’ll give you an update on the MLflow open source community where we have some exciting announcements and finally, I’ll talk about what’s coming next for MLflow. I think everyone at this conference is aware that machine learning is transforming all major industries and it’s involved in day to day and even second to second decisions in a lot of applications and business processes.

ML is Transforming All Major Industries

But at the same time, despite the huge potential and value of machine learning, building machine learning applications is highly complex and it’s quite a bit more difficult than traditional software engineering.

But Building ML Applications is Complex

So this is for a few reasons first of all, machine learning development is a continuous iterative process, the real world keeps changing and so MLflow applications and models have to be continuous updated over time. Second, these applications are highly dependent on data, so you need to have very solid and reliable data pipelines both during training and production. And finally, as a result of this, many different teams and systems have to be involved in these applications, including data engineering, machine learning or data science and application development teams. And this complexity makes it very challenging to build and operate machine learning applications. Now, as a result of this complexity, there’s actually a whole new category of solutions that’s emerging in the industry called machine learning platforms.

A Solution is Emerging: ML Platforms

These are software platforms to manage the machine learning application lifecycle, all the way from data to experimentation and finally to production. And these have already been very successful in the largest companies that use machine learning. So for example, Google’s TFX platform, Facebook FBLearner and Uber Michelangelo, all internal platforms are powered under different applications at these companies. But we found that every company using machine learning in a serious way, is developing some kind of platform regardless of their scale. And these platforms have some some key benefits. The main one is that they standardize the ML development and management process within each company to make it much simpler. But the way that companies are building them today also has some significant challenges. In particular, today each company is building and maintaining its own platform and that’s a lot of replicated work. It’s also quite difficult to do, because machine learning technology is changing very quickly. So if you’re developing one of these platforms, you constantly have to add support for new versions of ML libraries you know, new algorithms, new deployment processes and so on. So at Databricks when we started looking at this problem about two years ago of how to provide an ML platform, we asked ourselves whether we can take a very different approach and unlike these existing company internal platforms, we wanted to create a platform that’s open source and Databricks community can collaborate on, so that it’s very easy for everyone to have the latest and greatest support in their machine learning platform. And that’s what we did at MLflow, the first open source end to end machine learning platform, that we launched actually two years ago at at the same conference, or at Spark-AI Summit.

An Open Source ML Platform

So MLflow can consists of four components, that work together to manage the machine learning lifecycle. They include MLflow Tracking for experiment management, MLflow Projects for reproducible runs, MLflow Models to package models and then deploy them in a reliable way to different deployment back end and finally the Model Registry which is the newest component, it’s a centralized hub to share and review models for an organization. And these components all work together, any individual involved in machine learning can use them and they’re also designed around this open API philosophy where everything is using REST API’s and command line interfaces, that are easy to integrate with existing tools. So in particular, this means you can use any programming language with MLflow, with these APIs, and you can also use any machine learning library and the community has actually built great integrations with a lot of very popular libraries, but it’s also very easy to integrate MLflow with your own custom processes. So when we launched MLflow, we had no idea how this project would do you know, would it be possible to design an open source platform that can actually provide value to all the companies that have to build their own custom process, and also would there be any kind of community forming around it. And we were really blown away by how quickly the community has gone.

MLflow Community Growth

Today two years later, MLflow is already up to 2 million downloads per month, just in Python alone, that’s quite a bit more than many individual machine learning libraries and it’s also up to 200 contributors from 100 different organizations you know, just for comparison, our team working on Databricks is less than 20 engineers. We’re also seeing very high usage, just on our own cloud platform, we see over a million experimental trials track per week with MLflow and 100,000 models registered, that people want to share with the team and with you. And these numbers both usage and downloads are growing at a rate of four times year over year, very fast growth through it at this stage. Just to give you an example of the breadth of these cases, at this summit we have, you know a number of talks on MLflow, including this that I’m highlighting here. So T Mobile has been using MLflow to maintain their ad fraud detection platform, where they need to monitor over 200 metrics from different data sources to make sure that the models are not drifting and are giving accurate results. ExxonMobil is using MLflow to deploy and then monitor thousands of models for predictive maintenance across their business. And finally, Virgin Hyperloop One is designing this high speed Hyperloop transport system and they’re using MLflow for experiment management to look at the results of thousands of simulations about all aspects of the system, not just machine learning, and they’ve been able to increase you know, the projected efficiency of the system quite dramatically by running all these experiments. So these are just some example use cases that you’ll see full talks about. Because the community has been going so quickly we also wanted to make sure that it can keep doing that in in the future and I’m really excited to announce at the summit, that Databricks has donated MLflow into the Linux Foundation so there’s a large you know, nonprofit vendor neutral foundation that’s managing the project and that will make it very easy for a wide range of organizations to continue collaborating on MLflow.

Announcing: MLflow joins the Iinux Foundation

And we’re really excited to see how this can lead to even more organizations building on and contributing to MLflow in the next few years. Okay, so that’s been a little bit of background on MLflow and an update on the community, the last thing that I wanna talk today is what’s next. At Databricks, we’re doubling down on investment in MLflow, and quite a few other companies are contributing heavily to it as well. So in particular, I’m gonna talk about see ongoing efforts, Autologging, Model governance and Model deployment. So let’s start with Autologging, in MLflow it’s easy to record custom information about your training runs.

pdated in 1.8

And one of the tools that we added to do that is these integrations called Autologging into a lot of popular machine learning library. So you can just write one line of code and MLflow will automatically record a lot of details about the models you trained using that library and the metrics about them. So all the parameters in your model, the actual model file itself, so you can then deploy it in different places, all the training curves and so on. We started providing Autologging features last year and we’ve seen you know, very fast adoption of them, so we’re actually adding quite a few exciting new features. One of the ones that I wanna talk about that came out in MLflow 1.8 recently is Autologging for Spark data sources, which is our first foray into data version with MLflow. So if you’re using Spark to create your features or your training pipeline, you can automate, you can turn on Spark Autologging and automatically record exactly which data was created. And in addition to that, if you’re using Delta Lake, which has support for table versioning and traveling back in time to see an old version of the data, we also recorded exactly which version number was used. So what this means is that, if you’re running your job with Spark Autologging, you’ll get this information about what was read and you can then reload the exact same table version later. And this is incredibly useful for debugging your model later on, if you want to see what data went into and for creating new models sequence with it and so on. It’s one of the biggest pain points that we’ve seen users have when they manage data and they’re trying to do machine learning. So this is already out, we’re excited to see what people do with it. There’s also MLflow Autologging support in quite a few other libraries, there are six libraries that are already supported in the main repository, and there’s also a team at Facebook that’s contributing files Autologging and at Databricks we’re working on scikit-learn and we expect to see new Autologging integrations with Autologging systems as well. The final new feature I wanted to mention with Autologging is that if you’re a Databricks customer, we’ve integrated Autologging cluster management and environment features in Databricks.

Environment Autologging and Cloning on Databricks

So if you are doing an experiment on some database for example, in a notebook or in a job, we will automatically record to both the version of the notebook that you used and exact snapshot of it and also the cluster configuration and any libraries that you attach. And then we provide this Clone Run button, so it’s very easy to create a new cluster with exactly the same configuration and libraries as your notebook and create a snapshot of the notebook as it look back then, and add it again or modify it. So this really shows the power of Autologging is going to be super easy now to get the exact same environment and reproduce and iterate on results. Okay, so that’s the first feature I wanted to talk about Autologging. The next important set of features is of course, once you’ve done a bunch of experiments, you’ve built machine learning models and we see one of the biggest pin points of organizations is around model governance and reviewing and deployment. So we’re actually adding a couple of very exciting features to the MLflow model and is key to help further with this process. The first feature is MLflow Model Schema.

new in 1.9

So we found one of the most common pin points with deploying models is that you might have you know, the data that the model was trained on might have different feature names or different you know, different properties than the data that you have in production or even the data you use in your previous model. And so we’re extending the MLflow model formats to include support for Schemas which store the input and output data types for your models. And whenever you call one of the log model APIs in MLflow, it will now also record the Schema, so it will record you know, what fields are required as input, for your model and also what it produces. And then we’ll use these in the rest of the MLflow tools to check for compatibility and why. And if you’re deploying a model that needs to work on a different type of data from your previous one. You can just compare two of the models and get a warning that these are incompatible with each other. And we think this will be extremely useful you know, when you plug into API, into different deployment tools and other custom review tools that you might have. So that’s Model Schemas. The second feature I wanna talk about is extending the Model Registry with custom Tags.


So we found that a lot of organizations have very custom internal processes for validating models. For example, maybe your model has to pass a legal review for GDPR compliance, or maybe it has to pass a performance test so that you can check whether it’s fast enough to deploy on the edge devices. So we’re adding these Tags as a mechanism to add your own custom metadata for these models and keep track of their state and this also integrates with API so you can run automatic CI/CD tools that test your models, add these Tags and check off on them and make it very easy to check whether your model is ready for deployment. And we think this will help a lot of workflows, a lot of model management get quite a bit easier and more automated. Okay, and the final set of features I wanna talk on are model deployment. MLflow has integrations with a lot of model deployment systems, it’s very easy to push your model to them. But because we’re seeing so many of them contributed, we actually wanted to make deployments a first class concept and give you a single API to manage all these types of deployments.

mlflow Deployments API

And so we’re creating this this deployment API for managing and creating deployment endpoints that will give you the same commands to deploy to these different kinds of environments. So you don’t have to worry about the individual details. This is already being used to develop two new endpoints for RedisAI and Google Cloud Platform. And also pointing a lot of the best integrations that we have for different deployment systems to go behind this API. So this will give you a simple uniform way to manage these deployments and push the same model to different serving platforms. And for Databricks customers, we’re also really excited to announce a model serving system integrated directly into database.

Model Serving on Databricks Q

So if you use open source MLflow today, you can already track models and use the Model Registry to manage the other different versions of them and push them to different serving systems. But if you happen to be doing this on Databricks, there is also now turnkey way to directly deploy models from the registry, as soon as you promote a new model to production, it will automatically be served in a REST Endpoint. And so data scientists can do this without involving anyone else at all, in upgrading the service and of course, you can hold back by just querying a different version or moving the version out of production. So a very simple turnkey way to do this, but we think it will make it even easier if you are using our hosted MLflow service to productionize your models. Okay, so that’s an overview of the new features, it’s nice to share about them, but it’s even more exciting to see them in action and for this purpose, I’d like to invite Sue Ann Hong, staff engineer on MLflow team at Databricks to give you a demo.

Demo: Model Management & Serving

– Thanks Matteo. There’s a lot going on in the world these days, so I’ve been reading a ton of Puppy News.

Virus Dutbreak

But even Puppy News can be overwhelming, for example, as a new mum, I don’t know how I feel about 14 puppies. So I decided to build an AI backed browser extension to tell me how to feel. Let’s turn this on.

All right, I think images make the internet better and I think if I had 14 puppies being good news, they’re very cute. See, it’s not always correct, but it looks pretty good.

Today, I wanna tell you how I built this real time machine learning application in just a few minutes and how I can make it even better using MLflow Experiment Tracking, Model Registry and Databricks Model Serving.

First I need a sentiment analysis algorithm, that will take text and return an image. I’m using beta here, which is a real based algorithm. I’ve constructed the model and I wanna log some information about it to MLflow Tracking for posterity. I’m going to log the algorithm name as a parameter, as well as number of sentiments and we’ll also log the test accuracy as a metric.

Accuracy is not bad, so we’ll try putting this into production by logging the model to MLflow Run and Model Registry by specifying this registered model name parameter. We’ll also log the Model Schema or Signature, which is a new feature that Matteo mentioned earlier. And this model takes in a string and I’ll put another string which should be the emoji We’ve created version one of model text to sentiment and Model Registry. To see all that we’ve logged here, we can go to the run sidebar, which lists all of the runs that were logged from this notebook. Clicking on one, takes us to the run details page, which has a link to the source, notebook snapshot parameters, metrics and the model that we just logged. To go to the Model Registry, we can click on this link or we can use the model sidebar and search finding.

We have our first one here and I really, really wanna use it in my application right now. So I’m going to transition at stage to production.

Now we can go to the new starting tab and enable real time model serving behind the REST API interface. So with one click, we’re putting up an endpoint that will call these machine learning models on demand.

Our version one is pending, but once it’s ready, it will be available at this URL. That is the workspace domain, the model name, as well as the version number or the stage that it’s in. So this production URL for this model is pointing to version one right now. Let me copy this URL and while the endpoint comes up, I’m gonna go work on my browser extension. So here’s the part of my browser extension code that calls the scoring endpoint. I’m going to make a simple HTTP request and I can paste in my URL here and you can see the old model I was using. And when the scoring endpoint returns an emoji, this onLoad callback function will modify the headlines with the emojis.

All right, our endpoints are ready now, so let’s go update the browser plugins as we change the endpoint name in the code and it still works. So in just a few minutes, we set up a REST API endpoint for a model scoring and built an application that changes the web content in real time, not bad. But I promise to the better, so I’ve asked a friend of mine to build a better model for us.

And then it looks like she has been very busy and added two new versions to our model. And the serving system has automatically created endpoints for them already. Now I want to go and see the latest version, since it’s probably the most sophisticated one. And it’s using TorchMoji, ApyTorch implementation of DeepMoji or BiLSTM-based text to image translation model. And she’s also added contextual features for better accuracy. This sounds fantastic, but before I transition this model to production, I need to do some due diligence. So I’m going to compare the new versions to the old one and in this view, we can see the information about each version that comes from both Model Registry, and the source runs like the parameters that we had logged. The yellow rose highlight the differences between the versions and it looks like version three has a different model Schema than versions one and two, which makes sense this my friend said she added new feature to improve the accuracy. So if I deploy this version now, my application will break because it’s not sending in these new features. But we have version two here that has better accuracy than version one using TorchMoji as well and has a matching model Schema to version one. So let’s put this one into production.

Now back in the serving tab, we see that the production URL points to version two now, which is great because it means I don’t have to change my application code to use the new version. I am feeling a little nervous about that model Schema difference I saw, so I’m going to try out the endpoint using this test box.

Version two looks good and version three errors as expected. Now it’s time to see the new version in action.

Alright, so I’m very happy because once the model version update one Seamless into the new model does seem better.

And now my puppy pages are more heartwarming. So in under 10 minutes, we made the internet better, twice and even if you’re not a subscriber of Puppy News, you can imagine applying the same techniques to other important domains like Support Tickets, User Survey or Customer Feedback. To recap, today I showed you, easy deployment of models, model versioning and collaboration and finally Seamless application updates using MLflow’s end-to-end model management tools. Thank you for watching, and I hope this demo brought a smile to your face, back to you Matteo, and here’s a puppy.

– Whoa, thanks Sue Ann, that was a great demo. So the last thing I wanna end with is how to get started with MLflow.

Getting Started with MLflow

We made it very easy to get started with the platform, you can just install it through PIP and it’ll just run on your laptop. You don’t need to set up a server or create an account anywhere to use it, just do our tutorial and you’re up and running. If you want to try it on Databricks, it’s available in the free Databricks Community Edition and we also have an in depth tutorial today at the summit at 11:00 am, for you to try it out. Finally, my talk was just scratching the surface of what’s possible with MLflow.

Watch more Spark + AI sessions here
Try Databricks for free
« back
About Michael Destein


Michael Destein is a data junkie. Going back to the 1990’s he has been focused on data integration, data management, and data analytics. With roles in solution architecture, product management, product marketing, industry solutions, sales, and now leading ISV partnerships for Talend he knows how to apply data and data management to complex business problems and extract superior business results.

About Cameron Davie​


Cameron works as a technical advisor and consultant with Talend’s most strategic technology partners, understanding and sharing the details of the technical integrations between Talend and our partners’ array of products and platforms.