Apache Liminal (Incubating)—Orchestrate the Machine Learning Pipeline

Download Slides

Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way. The platform provides the abstractions and declarative capabilities for data extraction & feature engineering followed by model training and serving; using standard tools and libraries (e.g. Airflow, K8S, Spark, scikit-learn, etc.).

Apache Liminal’s goal is to operationalise the machine learning process, allowing data scientists to quickly transition from a successful experiment to an automated pipeline of model training, validation, deployment and inference in production, freeing them from engineering and non-functional tasks, and allowing them to focus on machine learning code and artifacts.

Speakers: Lior Schachter and Aviem Zur


– Hi everybody, thank you for joining us to the first public talk on Apache Liminal an incubator project, I am Lior and Aviem will join me at this talk please feel free to ask your questions in the chat area during and after the talk. So first few words about us, I’m Lior Schachter the CTO at Natural Intelligence. I started my professional career 20 years ago doing product insider after completing my engineering studying, I deep dive into the area of domain specific languages, building engineering DSL engine and writing many articles of the subject, in 2009 while joining an Ethic company I got first introduced to their Dupaco system in its early days and so we needed to process billions of daily events. We built this very cool ETL engine compiled a 18 and installed it on Gmails, workstations and the magic happened. In 2015 I started to work on ML incorporating the sharing features into my products so in the last five years I’ve seen how the ML/AI paradigm evolved and mature to become a standard defacto. Half a year ago we decided to share a liminal is the Apache Incubator project. Aviem your turn.

– Thank you Lior. I am Aviem I’ll be joining this talk as well. I’m Data Tech Lead at Natural Intelligence and my specialties are data platforms as well as open source software and in my free time I like to play the card game Magic: The Gathering and back to you Lior.

– Great, so we will start with describing the motivation which drove us to develop and share the liminal project. There has been a tremendous evolution in recent years in our ability to build and run machinery models, hundreds of solution in companies are formed and are working on mature and existing technology and building new initiatives. I believe that the vision of data scientists walking into and from research to production is very close to reality. Nevertheless, the ecosystem is still shaping and a lot of focus in many resources in academia and industry are working with the next evolution of machine learning which will increase its agility and the robustness and make it accessible even to smaller organizations. Current focus is on AutoML monitoring and MLOps. So this is the state of the ecosystem, let jump to speak about Natural Intelligence and see how the stories relate. So Natural Intelligence; we are globally the multi-vertical online comparison marketplaces, our great technology matches between consumers and brands assisting consumers to make more confident and faster decisions while helping brands to go. We are a bootstrap company founded in 2009 and we work in a variety of verticals like mortgage insurance, mail delivering, homogeneity and many many more. We have started our journey to productize core business functionality using machine learning and AI about a year and a half ago. The goal was was to take few errors at the core of our operation and automate them using machine learning, one focus area is the website personalization and ranking. This is of course our bread and butter and the second one is around AdWords bidding, so we are among the 50 largest AdWords account worldwide and we decided to build our own machine learning based video. So how did we approach this challenge? I believe we did same as most companies. We had a very good data platform it is proven to be effective, robust and scalable so it made perfect sense to add to it the machine learning ingredients mainly the Python libraries like scikit-learn and the nonpy and many more of course, and the flux at the flask server for online inference. Again I believe many organizations go through this route is you have proven in the stack that you rely on and you know so evolving it to address more use cases is the way to go. The problem we encountered and I must say even few weeks after we started is that the data scientist are struggling in making thing works. It happens both in local environment and in staging any production practically they can’t really focus in algorithms and we call these the orchestration barrier and it stems from the diversity in infra, GCP, AWS, Azure and many more, the amazingly high number of languages platform and libraries that are needed to build such workflows. So the orchestration barrier of course is something that impact our ability to deliver machine learning features and solution quality and I believe that the impact of it is that the time to market such features is often too long and the, most of the time unpredictable. So this is where we posed and of course looked around to find an existing solution so we reviewed both open source and the commercial platforms or solutions but could not find anything that fits our architecture and vision. I want deep dive into the details of the comparison as the time does not permit it I intend to write a blog post on the subject in the near future. Anyhow I hope that through the characteristics of liminal and the theme of it we will show it would be apparent why there isn’t such a solution available, so let’s say, see what say what’s liminal is about. Our one-liner is very simple, is let data scientists focus in data science from research to production. How do you do that? So first and foremost, we have our domain specific language to express orchestration and provision concerns. It abstracts away the complexity of the underlying technology providing high level abstraction for configuration and tuning. In many ways liminal enable data scientists to define their machine and system using configuration from feature engineering to production monitoring and the framework takes care of the info behind the scenes. The DSL itself is of course extensible and in Python through our plugin architecture, it is minimalistic by nature, as I said it is a Gapel DSL which generates the gartem artifacts for the technologies that do the actual work so it can be adapted to any stack that exists in the organization. All you need to do is plug it in into a liminal and get the orchestration capability. The solution is of course scalable as the technologies you have chosen beneath it so if you chose wisely liminal won’t, to the complexity or limit the scalability and lastly it is the open source since both myself and Aviem are big advocate of the open source community, we’ve been working it with it for many many years and it made perfect sense for us that if we developed something that is not at the core intellectual property of our company and just Natural Intelligence what we do is comparison websites not machine and platform, it would be perfectly illogical to contribute it to the community. I must say there is a CTO that I see a great value of more companies, more companies joining the effort and making the platform much faster. So I really call both individuals and companies to join us to this very cool effort and make the machine learning accessible even more and make data scientists even more independent in their daily work. So now we will deep dive into the details of the liminal way, Aviem it’s all yours.

– Thank you Lior and now what I would like to do is take you on a deep dive into how liminal works. So, first let’s discuss the problem that we’re trying to solve. As I can see by this immense graph the problem of getting a code from data scientists to be deployed as a production grade applications in production and online inference via serving layer and integrating with all of the underlying platforms and infrastructures is a massive feat to put on a data scientist to do and we want to take this problem away them and give them the easiest way from science code to production applications and with minimal efforts within the organization for any engineers or apps as we can manage to make this pluggable into their existing architecture. So that’s a big problem to solve, it’s very complex. How do we intend to deliver on that? So at the center of any software development there is the software development life cycle which repeats and usually we start with some application developer in our case the data scientists who works on their IDE in our case coding some algorithms, from there they usually push their code to some source control and usually from there we get some sort of CI/CD pipeline which will compile our code and deploy it to run time and the run times can be varying as well. And from the runtime where our applications are actually running, we send out logs and metrics so we can create monitoring and alerting on those to deliver feedback to the application developer and that closes the cycle in which they develop, deploy and get feedback so they can improve the application. So again this is a lot of things that we don’t want a data scientist to have to know to do themselves, we want to abstract these away. So how do we want the data scientists to interact with the platform in order to achieve the software development life cycle that we desire? So all that we ask of a developer of a liminal app is to add one file to the repository, make liminal YAML which defines the application in liminal’s configuration language which we will now see and this will abstract away all the problems that we referenced before and the only thing they have to get over is configuring their application in a very simple manner which we will now deep dive into. So how does this YAML file look like? You have a name for applications and an owner, then we have a section of what we call services and services or applications that are constantly running and receiving some requests from the outside for some value and they serve that back to whoever requested that. So in our case we want to have a real time inference service which serves our trained models. So, let’s imagine this application is training a model which can predict if a given flower based on its characteristics is the flower of genus Iris virginica. Now we can do this based on the open datasets of Iris and we can train cool logistic regression model on this and have all the orchestration of training and deploying and serving this abstracted away from us so we can focus solely on developing our code. So after we define a service, we want to define endpoints so users can access your service usually we do that via a HTTP protocol, a web call and we navigate to some address within the server so all the user needs to do is define the address they want, the path and which function and which model their code they want to invoke. And that’s it, that’s all they need to know about developing services in order to achieve that. The next section that we have in the YAML file describes pipelines. So pipelines unlike services are applications that aren’t constantly running but usually invoked on some schedule. So in our example, we want to have this model be trained on a schedule, let’s say on a daily basis if we have a lot of new data coming in that we wanna train on and so forth. So, I wanted to find a pipeline of the actions that will take place on a schedule and run my code again with the most minimal knowledge needed. So in our YAML file we define a pipeline as a set of tasks which we want to have run sequentially on a schedule and depend one upon another safe, one task fails you don’t proceed to the next task. So they can achieve this by defining a pipeline as you see here on the screen they defined here two tasks. The first task is running the training code. So this will invoke the code and the model and function that they asked to invoke and this case they simply pass a Python command so the most flexibility they can have and this will train the model. The next task defined here is called Validate. This will validate the model we just trained and if it passes validation, we’ll deploy it to production where the serving service that we have constantly running will be automatically updated by that deployed model and we’ll be able to serve the newest model. That’s the whole thing, that’s what we want the users to do have this minimal effort put in to achieve production grade applications with monitoring and alerting and everything else to expect from modern application. So let’s look over to the rest of the code. This will be non liminal code but user code in our example. So here I created a rudimentary model store using Amazon S3. We don’t really need to get into the code here but what I want you to take away is that we can upload models, save them and we can download the latest model. So get the latest model that exists and it’s cashed here and refreshes every so often, obviously it’s not a optimal implementation but just for the purposes of our demonstration this is the model store that we have and next we have our requirements file as in every Python project they define the requirements and they add liminal so they can have access to the liminal command line interface. Next we have all the code that we require for them to write for serving so again we want minimal to no intrusion on the user code and as you can see this whole thing is user code and does not import anything that any data scientists would already be working with. So what they do here is they call the model store to retrieve the latest model and based on an input return the output of the model. So most simple method ever, they receive an input in JSON and return a result as a string. And that’s all that we required from them and if you recall in the YAML file, this is the function that we were referencing in our endpoint in the service so when people call, predict endpoint on our service it will run this function. So as simple as that they don’t need to know anything, just define a function, receives a JSON returns a string response. The next file we have the training file. So here it’s pretty much boilerplates, logistic regression model, using the Iris dataset from scikit-learn and the code here in the training function, we train a model and deploy it, save it to a candidate model store so as to not affect production. In the next function we see here Validate which is the one that we invoke in the second task in our pipeline. We validate that the model and the candidate, the latest candidate is valid. So here in this example there’s not many actual validation we’ll just check that the model can be read and can be used without crashing but you can expect actual applications to have business validations of a model before deployment. Once all validations have succeeded we proceed to save the model to the production model store and then next time that the serving application reaches a time to live on its cash model, it will check again for the latest model and it will receive this new model that was trained and that’s how they talk to each other. Again you’ve seen the code and the user code there’s nothing to involve liminal, nothing to involve anything other than boilerplate Python code that runs scikit-learn code. So let’s see a demo of running this application and see what that looks like. So the user can interact with liminal if not doing so through some CI pipeline as we mentioned before, but stay on the local machine through the liminal command line interface. So the first command we want to run is liminal built; liminal build will create a Docker image or Docker images defined in the YAML file as user source code. So again, we don’t want them to know anything about Docker they just reference the location of the source within the repository and we create a Docker image for them out of this and then we have liminal deploy which allows us to deploy our YAMLs to a liminal runtime. The current available one is using Apache Airflow which is another project in the Apache organization that we believe in and work with. So once we deploy our YAML there our pipelines are ready to run and we can run liminal start to start the liminal server in this case being Apache Airflow serve. So, let’s see what the result of these commands looks like. So here’s a pipeline running on Airflow and it has the steps that match the tasks defined in the YAML that we saw. So again with pretty much a click of a button and definition of a YAML, the user now has a deployed scheduled pipeline with monitoring running on Airflow in this case. But again liminal is built to be extensible and built to be pluggable into different run times. So that’s what the pipeline run would look like. What will the server look like? So let’s have a look. So here’s our request to the server that we started with the image resulted from liminal built and if you can see we sent a request to the predict end point as defined in the YAML, pass it a JSON body in this case input about a flower describing a petal width and the response from the service that we get is the probability of that being an Iris virginica which is what our model predicts. So in this case 0.8 something, so 80 something percent that it is an Irish virginica in this case. Let’s take look at the logs of the server at this point. So as you can see here a flask server will start it and a request was made which prompted our prints from our code to print to the screen, to the log the input that was received, the model that it received from the model store, the cash model of receipt and the result that it returned which matches what we saw in the request. And that’s it, that’s all that we require for data scientists to have production rate applications that are running and monitored and alerted in production and fitting their organization’s existing stack with as minimal effort as can be from their organization. That that is our goal. So what is next for the project? One of the things where we intend to tackle first os CI integrations, so as we mentioned we want liminal to be able to run in different organizations in different stacks and we want to supply CI code which allows people to introduce liminal into their existing stack without any intrusion as we can. The next thing on the list here is user interface. While what, what we described up until now is very simple, we want it to be simpler. So we want data scientists to register these sorts of applications via user interface. So not even know that YAML and the YAML language required, that is another barrier we don’t even wanna have. So we wanna have a friendly wizard that guides them through drop-downs where they define the application that they want, the code base that they want possibly representing what data exists in the company and the sky is the limit. We also want to integrate with existing ML frameworks that are being used today in the industry. So kubeflow, MLflow, different feature stores and so forth to make that easy for users to engage with, experiment tracking to provide users with experiment tracking information, this can of course be related to the user interface path that we wanna go to cloud supports, we wanna natively support all clouds to run liminal long so be at GCP or AWS or Azure and we want to introduce model store as part of the platform so we want to again integrate with existing solutions for example, ML flows model store and allow users access to that in a simple manner again via an application definition and no code required. At the center of the soul is the open source community. So while we’re dreaming big here there’s a lot of things that we want to achieve, we cannot achieve them alone. We can achieve some of them but what we want is together to achieve something great that we can all use and we are great believers in Apache and the Apache way and the many eyes principle and believe that if we can get contributors from different companies, different countries then we can have different ideas, different contributions that can help make this project great. So what we would like is to invite you to join the effort and you can join us on our websites at liminalapache.org and join our mailing lists. You can see the project on GitHub and see the issues tracked in Apache’s JIRA and try to tackle them yourself. We are looking for contributions, we have again want to lower barriers on contribution as well and be able to accept contributions from you and the smoothest way possible. So together we believe we can make great things happen but we need your help so come join the effort. We are open on the mailing lists. Talk to us.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Lior Schachter

Natural Intelligence

Specializing in Big Data and Machine Learning, Lior Schachter has been building large-scale systems for over a decade. Lior is the CTO of Natural intelligence, a global leader in multi-vertical online comparison marketplaces as well as the co-creator and PPMC member of Apache Liminal (incubating), an orchestration platform for ML/AI pipelines. He holds a BSc in Electrical Engineering & Computer Science from Tel Aviv University and has authored numerous articles on Adaptive Systems and Domain-Specific-Languages.

About Aviem Zur

Natural Intelligence

Data tech lead @ Natural Intelligence, PPMC Member, Apache Liminal, PMC Member, Apache Beam. Specializing in data frameworks and platforms as well as open source software. Passionate about quality engineering, open source and Magic: The Gathering.