A typical machine learning pipeline begins as a series of preprocessing steps followed by experimentation, optimization and model-tuning, and, finally deployment. Jupyter notebooks have become a hugely popular tool for data scientists and other machine learning practitioners to explore and experiment as part of this workflow, due to the flexibility and interactivity they provide. However, with notebooks it is often a challenge to move from the experimentation phase to creating a robust, modular and production-grade end-to-end AI pipeline.
Elyra is a set of open-source, AI centric extensions to JupyterLab. Elyra provides a visual editor for building notebook-based pipelines that simplifies the conversion of multiple notebooks into batch jobs or workflows. These workflows can be executed both locally (during the experimentation phase) and on Kubernetes via Kubeflow Pipelines for production deployment. In this way, Elyra combines the flexibility and ease-of-use of notebooks and JupyterLab, with the production-grade qualities of Kubeflow (and in future potentially other Kubernetes-based orchestration platforms).
In this talk I introduce Elyra and its capabilities, then give a deep dive of Elyra’s pipeline editor and the underlying pipeline execution mechanics, showing a demo of using Elyra to construct an end-to-end analytics and machine learning pipeline. I will also explore how to integrate and scale out model-tuning as well as deployment via Kubeflow Serving.
Speaker: Nick Pentreath
– Hi everyone and welcome to this Data + AI Summit Europe talk on Notebook-based AI pipelines with Elyra and Kubeflow. I’m Nick Pentreath, I’m a principal engineer at IBM. You can find me @MLnick on Twitter, Github, LinkedIn. I work within a team at IBM called the Center for Open-Source Data and AI technologies or CODAIT. We are focused on machine learning. I have a long background in the Apache Spark project where I’m a committer and PMC member, and I’m author of “Machine Learning with Spark.” I’ve talked to various conferences, meetups, and more recently many online events around the world on topics related to machine learning and data science. Before we begin a little bit about CODAIT or the Center for Open Source Data and AI technologies, we’re a team of over 30 open-source developers within IBM. And we aim to effectively improve the enterprise AI lifecycle outs in open source. We contribute to and advocate for any projects that are part of IBM’s AI offerings and are foundational to those offerings. So this includes Apache Spark as a core component, of course, as well as the Python Data Science Stack open-source projects for open data and model sharing and exchanges, deep learning frameworks, as well as AI ethics and model survey, as well as orchestration on CUPRA. So, today we’re gonna talk a little bit about that path and data science stack in particular Jupyter and a project that we developed within team called Elyra. So today we’ll start with an overview about the machine learning workflow. Then talk about Jupyter and Jupyter Notebooks and the Elyra, and we’ll give a live demo of some of them Elyra’s capabilities, and then wrap up with a conclusion. So to begin with the machine learning workflow, typically starts with data. We take that data and we analyze it, we explore it. Typically in order to use raw data machine learning, we have to do some kind of pre-processing. We can’t just throw that data straight into the machine learning model. It doesn’t arrive in a easier packaged format that is numerical vectors. We need to extract features, do pre-processing and then get that data into a format that it can be used. We then go through a model training, selection. Once you have a trained model, we then deploy it. And that’s typically where a lot of the discussions sort of ends, but actually even once a model is deployed, we need to still do quite a lot of work to maintain that model and monitor the prediction that it makes. And then a model that’s out there running in the wild will also be impacting its own data effectively, its own training data and new data will be arriving. So we kind of complete this workflow and it becomes something of a loop. Now this workflow spans teams. When you talk about data side, that’s typically the domain of your data engineers that are responsible for your data storage, providing schemers, managing access, data governance and so on. This loop or workflow in the middle is the provenance of your data scientists and researchers. So they’re typically pulling data into their process, doing the analytics pre-processing and feature selection and extraction model training. And then the final team is machine learning engineers and production engineers who are responsible for the production infrastructure. Who are actually deploying models at scale. There’s a lot of potential conflict that comes between these teams and also the wide variety of tools that are being used. So there’s a wide variety of both standard and non-standard data formats for analysis and data visualization, as well as the machine learning and data science toolkits and frameworks. Each team will have typically multiple frameworks and toolkits being used. And every data scientist researcher has their own favorite time to put your older version to be supported. And when it comes to deployments, you have a, again, a variety of formats and mechanisms, ways of deploying things, and even the languages and infrastructures are useful. Large-scale production deployment, it’s typically quite different from what you have in your data science and research workloads. So there’s a lot of potential conflict here, many of these teams need to work across silos and across frameworks. And the production machine learning systems to support all of them. Now, core parts of the data science and research workflow is iteration and experimentation. So this happens at each phase in this process. So when data scientists are analyzing data, data process doesn’t just happen once. You just don’t go through this workflow of loading, data cleansing, exploration, interpretation, and analysis, and then you’re done. Now, typically what happens is, you start with a problem that maybe well defined actually, if you’re lucky or fairly ill-defined. And that problem is a set of questions that business needs to answer where a particular problem we’ll use case that needs to solved with data. So typically the data scientists will load the data. It almost never comes in a nice clean format, there’s many discreet data sources, so a lot of data cleansing has to happen. And then you start asking questions of the data to doing analytics, exploration, computing, statistics, aggregations and creating database visualizations dashboards reports that are then interpreted to give an outcome. But typically either one of two things happens. Sometimes one of them typically most of them, either the answers that have come back themselves create more questions or potential issues are found in the data itself or in some part of the process. And then you have this effectively iterative process of refining this workflow. So going back to the beginning, maybe fixing some of the data, maybe bringing in more different data sources and adjusting the way that it’s cleaned, adjusting the analytics itself. So there’s a lot of, it’s really workflow that happens here. This is similar in the machine learning space, where we follow this process of taking the raw data, extracts and features, pre-processing those features into a format that’s usable for machine learning training. That’s typically a model selection process, just to train one model. Typically you’re training many different types of models, many different pipelines to try and find the one that fits the best. And that’s the evaluation process where you’re actually evaluating the model on the set of test data based on some metric. And typically this is also an iterative process, that’s heavy on experimentation. Data scientists need to try things out and test out different combinations of parameters or different pre-processing steps, feature extractors, different models, different combinations of all of these things. So this is also a loop where the process and the workflow is constantly refined. So experimentation and iteration are critical and that’s why notebooks and in particular, Jupyter Notebooks have become the de facto standard for this type of workflow, for concentrating interactive work. Now Notebooks bring a wealth of power to the data scientists and a machine learning research into this workshop, but there are some issues with Notebooks and what you typically see happen in this sort of work workflow and process is that the Notebook where the initial exploration is happening, grows and grows and grows, and it becomes a monolithic structure that does everything in one place. It makes it very difficult and increasingly tough to actually extract out piece of code, functions, create a little mini libraries and modularize the code. So it becomes a bit of a behemoth. And that makes it a lot more difficult to productionize. So rather than have nicely modular pieces of code or functionality that are connected, you have this one big Notebook, which just can’t be chucked over the wall to production. And it’s also very difficult to scale up and deploy notebooks in a scannable manner. So to address this, and some of these issues, Elyra was created by our team. Elyra is a set of AI centric extensions to JupyteLab. The JupyterLab is an open source notebook environment that is highly extensible and Elyra is a set up these extensions. It’s actually named for Elara, which is a one of the moons of Jupiter. And you can see, Elyra here is orbiting the Jupyter ecosystem. So we’ll go through a couple of the core components and features of Elyra before we actually walked through it in a demo. So one of the most important pieces here is visual pipeline editor. So we can see here, this is a way to visually build AI and data science pipelines. Consisting of both Notebooks and Python scripts. So this is a DAG ,Directed acyclic graph. It can have multiple input and output nodes effectively, and multiple prompts structures. This shows the typical process of offloading data performance and some data processing and cleansing, and then splitting into multiple downstream tasks, some of which may be analytics and data science related, some of which may be machine learning related. So the power of Elyra is that, this one pipeline specification can be run both locally and remotely allowing you to test things out locally really easily quickly, but also when the time comes to scale up by running in this case on, at the moment on supported platforms, such as Kubeflow pipelines. So there maybe more coming down the line and unplanned. Kubeflow pipeline is the main, kind of promotes workloads. This means that once the insertive and experimentation phase is completed locally, the codes instead of notebooks can be modularized into a set of notebook modules, which are the notes in the graph, as well as potentially Python scripts. And these can all be packaged together as a batch job that runs through Kubeflow. So each node is executed in its own isolated container environment and has access to the full cluster resources. So related to this is the ability to execute single note boxes batch jobs effectively. This is a single node pipeline, and this is also possible for Python scripts. So Python scripts also first class citizens. It can be edited within Elyra, in the editor and executed against either local or cloud-based resources. There’s a few other features such as, the automatic generation of a table of contents from the markdown within notebooks, based on the heading structure. Which allows you to navigate easily between the different headings within your notebook. There is a code snippets module or plugin that allows you to create reasonable code snippets across various languages and insert them into your notebooks. And finally, a tight integration with Git for tracking project changes. This allows you to investigate the depths and compare your changes, as well as to easily import remote projects from kit into your Elyra workspace. So this means that you can easily share code like with projects and teammates. So Elyra is an open-source project. Then there’s a few ways to get started. You can either go to Binder and start it with low installation from your web browser. You can use the Docker containers, or you can install Elyra on your local machine. You have the links here to check it out. Okay, so we’ve gone through some of the highlights and the features, but I think what speaks the loudest is to do a live demo of some of the functionality. All right, so here we see Elyra running in JupyterLab. And you can see the JupyterLab launcher has added these components for Elyra. And I open a Python file and a pipeline editor. On the left here, we see my file browser. Now, if we haven’t looked at the pipeline editor, this is what it looks like. So you can see we’ve got a pipeline here that involves loading data from two data sources. Processing it, bridging it together, and then doing some downstream analytics tasks. In this particular case, we’re going to be using two datasets, which are actually hosted on the data asset exchange, which is one of the other projects that is coming from CODAIT. And the data exchange with DAGs, is a place where you can find free and open source datasets. We’re gonna be using two of them, namely the airline flight delay data and the weather dataset for the JFK airport. And our aim here is going to be to both analyze flight delays and their potential causes, as well as to trying to build a model, to see whether we can predict whether a flight will be delayed. The pipeline that says as you see, can handle both Python script files as well as Notebooks. And we have the ability to drag these components around and to create comments for each node. We can then connect nodes to each other. And as we can see here, we can have one to one relationships, as well as many to one and one to many relationships. So we can create any kind of DAG structure that we wish. Now, each of these nodes has a set of properties. And you can see here that the, for example, the runtime image is able to be specified, where you’re going to use Pandas, but you can specify a few pre-built ones that cover the main frameworks, as well as creating your own. Here is that this particular script uses an input environment variable. It was specifying the location, how to download the data and it creates some output files. Now, we then move on to our data processing, where again, we’re going to be specifying a runtime image and some output files that are kind of becoming out of this node. Now the output files are not strict, actually strictly required in local mode, but this output file location is used to make the output of each node available to downstream tasks when you’re running on Jupyter. But after we do some pre-processing and dedicating we’re then gonna merge these two together, these two data sets together and we’re then going to use that merge data set, which combines a flight delay data, as well as where the data had to do some analytics and some predictions. Now, in order to execute, we can just execute this workflow and select whether we want to run it on a Kubeflow or locally. So for now, we’re just gonna execute something locally. And as you can see here, this is our JupyterLab console on the command line. We can see that it’s busy executing. So it’s gonna be downloading its data and executing each of these notebooks. So we can see that in local mode, effectively the data is actually gonna be saved to our local file system. You can see that from our loading data phase, these files have appeared here. Now, this notebook is busy executing. And in the meantime, what we can actually do is we can kick off and run, that’s gonna be running on Kube flow. You can see here that Elyra takes care of packaging all the nodes and putting all the dependencies together and shipping them off to our Kube flow cluster and supporting that. So this is just a local Kube flow cluster, but you can see that that is busy executing at the moment. And if we want to, we can kind of look at how it’s doing, see the logs. I can see that our notebooks have finished executing. So we knew we were running in local mode. Those notebooks are actually updated in place. And if go ahead and open now flight data in our notebook, we can see that the cells have actually been, had been flowing there. So this is a typical data processing workflow where you start by reading in the raw data. And this contains a set of records related to flight delays for airports across the U.S. This is a small sample of the data, we’re gonna clean up the data to give us a certain date range and journey related to Plattsburgh originating from JFK airport. We allocate some airline names to IDs to make things more human readable. And then we do a bit of column renaming. So there’s not actually a lot of pre-processing that happens in this particular data set because it’s already pretty clean. But we can see here that we’ve got things like the flight dates, the airline, the origin, destination, the departure time, distance and of course, departs delayed, whether the flight is actually delayed or not. That’s what we’re going to be using to analyze and predict. Similarly, we can open up our weather data set, processing notebook and see that, we effectively stepping through very similar steps here. Reading the raw data, cleaning it up. There’s a little bit more cleaning that has to happen here. And in particular, we can see that this dataset has a categorical feature, which is the set of delimited strings that represents the weather type, that is present at that particular weather reading, as well as, wind speed, visibility, precipitation and so on. So there’s a bit of cleaning up here and categorical feature extraction that has to happen. And we can see that actually our pipeline has actually finished running in full. So if we have a look here, we can see that each of the notebooks is executed and we’ll be able to work through all of them locally. We also see that the outputs of each node, have actually been written into our data sub directory here. And then we go through to merging the data, which simply reading each of the data sets. And we’ll combine this flight delay data with the weather record for the hour previous to that flight. And what we want to do here is try and use some of that weather information to see if we can help predict airline delays. This is illustrating a very typical piece of the workflow, which is joining two disparate data sets together. Once you’ve done that, most often we’re going to be doing things like trying to train models and analyze data. So this notebook works through analyzing the causes of flight delays and try to see whether flight delays are linked to, or related to in some way variables, such as the day of the week, the departure time of the flights, for example. The airline and or the destination, the distance of the flight, as well as weather. So for example, here we can see is, the proportion of flights delayed when there’s drizzle and no drizzle present, versus when there’s snow or thunderstorms present. And this helps because it gives us an idea of whether these factors play a role in flight delays. And the final step in our pipeline is to try and build a predictive model. And this is using the same inputs, where again, we are reading the data, we’re doing our typical data science steps, such as creating, training and test data sets. Encoding categorical variables, numerical variables combining them together, and then performing a model selection process, using cross validation across a few different models with a selected metric to decide which model you’re going to be using. And all of this can be accompanied by charts and reports that illustrate cross elimination and evaluation performance. And then at the end, we take the best model and we’re going to put it on our full training sets and create classification report on our test set. Things like RSE curves, precision and recall curves and looking at profusion matrices. And this one is feature importance for our tree based models. Some of the typical components of your data science workflow. Now, we should see that this particular pipeline is obviously still running, it may take a little while, but previously here we have a pipeline that is executed. And we can see that it follows the exact same DAG that we have here. And you can go to some of these nodes and again, take a look at the logs. And what happens is each of these uploads, its results to cloud object storage or effectively some S3 compatible storage. Here for example, we see that each notebook as well as the logs is shown. So for example, we can look on our predict flight notebook and we can actually look at the HTML version of that. Which will give us the results similar to what we saw here on the column. I can go down here and get exactly the same output that we had. Okay, so that shows a typical pipeline. And what we haven’t shown here is obviously model deployment and some more advanced modern tuning. But because we were running on Kubeflow, we can take advantage of some of the components that are part of Kubeflow pipelines. So for example, we can link into model treating with Kateeb, or we can deploy to Kubeflow survey, once we’ve trained our model. Okay, so that concludes the demo. So I hope that the demo and the presentation have really illustrated some of the power and functionality of an Elyra. Elyra is still a young project, very active, developing rapidly, and we really welcome our community involvement. So please go and give it a try. You can run it locally. You can run it by Docker, you can run on Binder. And if you check out that link, ibm.biz/elyra-demo. The notebooks and pipeline definitions for the demo that we’ve seen today for flight delays are available on GitHub, as well as a set of notebooks running on Elyra that illustrate a pipeline for analyzing COVID 19 data, that we have developed. We encourage you to join the community and get involved, come and help us and be part of this journey. So thank you very much. Please, I also encourage you to check out codait.org. Find us on Twitter, GitHub, at developer.ibm.com. Check out the Data Asset Exchange that I mentioned and some other interesting datasets. And finally I know that feedback is super important, so please leave feedback by the online mechanisms for the Data + AI Summit Europe on the session. Thank you very much.
Nick Pentreath is a principal engineer in IBM's Center for Open-source Data & AI Technology (CODAIT), where he works on machine learning. Previously, he cofounded Graphflow, a machine learning startup focused on recommendations. He has also worked at Goldman Sachs, Cognitive Match, and Mxit. He is a committer and PMC member of the Apache Spark project and author of "Machine Learning with Spark". Nick is passionate about combining commercial focus with machine learning and cutting-edge technology to build intelligent systems that learn from data to add business value.
Nick has presented at over 30 conferences, webinars, meetups and other events around the world including many previous Spark Summits.