What do you do after your Spark workflow development is finished? Nowadays, data engineers dread putting their workflows in production. Apache Airflow provides the necessary scheduling primitives but writing the glue scripts, handling Airflow operators and workflow dependencies sucks! Low-Code Development & Deployment can make scheduling Spark workflows much simpler – we’ll show you how.
Additionally, we show column-level lineage for governance such as tracing your PII data, or for production issues such as understanding the downstream impact of changes to a workflow.
Raj Bain: Hello everybody and welcome to our talk today. We’ll be talking about Low-Code Apache Airflow. We talked about Low-Code Apache Spark yesterday. Let us first start with our introductions. I’m Raj, I’m the founder of Prophecy and before this, I worked in compilers at Microsoft. I was one of the authors of CUDA at NVIDIA and then I worked in databases, first as an engineer and then I was the product manager of Apache Hive Hortonworks. So a lot of data engineering and now, Prophecy is focused on making data engineering very easy for everybody.
Also, with me is the co-founder Maciej and he’s an expert in Spark data engineering, expert in Spark machine learning. And he’ll be doing the demo today. So as we look at Low-Code Apache Airflow, it’s all about making air flow, super easy to use. Right? So this is something that’s a new feature, we are presenting it for the first time.
So yesterday we talked about how data engineering is just too hard and too slow. How in development standardization is hard, enabling a lot more users and making them productive is super hard. Today we’ll be focusing on metadata. And for most of the people that we work with, metadata search and column level lineage is not something that they have on spot. So what are the used cases? Right, so typically if you’re in data engineering, let’s say you have a governance need and you have social security data in one data set. And perhaps you have 10,000 workflows that are moving into many other data sets. So now you want to track like, this column, which other data sets they did it end up in and column level lineage helps you do that.
Or perhaps you are changing a particular column. You’re saying I’m removing a column from a data set, but you want to know, is there other teams, other projects that are downstream, is one of them going to get affected. So you want to do impact analysis or perhaps you’re the data analyst and you’re looking at a value and you’re saying, hey, is this… How was this value really computed? You know what the definition is, but you want to know it was computed correctly before you make a decision on it. So for all of them, these use cases, column level lineage is super critical. It’s something that most of the people we work with want and don’t have. So which, we will give you, we will show you how we do it and give you a demo of this. The second thing we’ll dive into is Scheduling.
So now Scheduling Apache Airflow is good. It gives you a lot of basic parameters for scheduling, but scheduling on top of that is still very hard. And as an example, we worked with some fortune 50 companies. Some of them are like, okay, I am moving 20,000 workflows to Databricks. How do you want me to schedule them? It’s like, should I write 20,000 schedules manually? So do you run into these problems of how do you write that many schedules? How do you deliver them, make sure each one has the latest version of the workflow? If something fails, how do you recover? How do you monitor them? And perhaps there is cost optimization, right? Because if you spin up one cluster of workflow, soon you’ll be just spending all your time, spinning up and spinning down clusters. So we will give you a demo of local development of a budget Airflow and tell you about some new things that are coming right after.
So as we mentioned before, if you have data engineer… If you are a data engineer, you have use cases of impact analysis, understanding, or you’re a data analyst you want to understand the value. If you are metadata architect, you want to have PI information and so on. And these goals require a few features. So the first is automated lineage computation. What that means is that if you have your code, can the lineage be automatically computed from there and can it be stored? So it’s a place where, if you have a metadata system that is separate and says hey, you have to use these API calls to add lineage, you’re never going to be up to speed. So automated computation is super important. The second thing is, it needs to be multi-level. So you might say, I want to look across data sets and workflows across projects.
And then once you are [inaudible] the workflow you might want to dive in and the workflow can be very big. So you want to pinpoint exactly where the change was made. And finally, you want to be able to search, right? You want to be able to search by columns and say, where all is this column used. You might even have a user defined function. And you’re like, where all is it used in all my workflows and data sets, right? So you can search by authors, by columns, by even user functions and expressions. So we will give this as a second demo. Now we’ll move on to Low-Code Scheduling. So Low-Code Scheduling is basically where you can use visual drag and drop to build airflow workflows very, very quickly. So first focus is, it has to give you rapid development. That means you can do quickly… You can do drag and drop and enter only the essential values. Everything else is automated.
Number two, what about the deployment? Right? So let’s say you’ve deployed into a test environment, into a production environment and a workflow changes. Now you have to rebuild the jobs, redeploy the jobs, right? So there has to be automated deployment, but it also has to be automated, where all new redeployments can happen without any manual intervention. Finally, once you have done… Now you’re able to develop, now you’re able to deploy. Finally, you have to monitor. And when you have to monitor, you want to be able to browse all your deployments, all your runs and see what’s working, what’s not what is delayed. And perhaps later, [inaudible] of the parts are the ones that are taking most resources.
All right. And now just before the scheduler demo, I want to give a quick summary of how Prophecy works with your infrastructure. You have your [inaudible], you have your shell, you have your Spark, your airflow, so you have your infrastructure which you’re using currently. Prophecy adds a data engineering system on top. Of course, you’re getting the local Spark development, but the code is, and the test, the lineage, everything is going on [inaudible]. Now we talked about this yesterday. We showed you a demo. Today we are focusing on visually airflow editor, right? So you have a local airflow editor where you can similarly use drag and drop to create airflow schedules. The code that you’re creating ends up on [inaudible] in completely open source format. So it’s a 100 percent open source code, it’s in Python. It’s exactly as you would hand write it.
The other thing we provide is you can just connect to an airflow cluster and interactively run and see the airflow schedule running live. You can see the logs as it moves from step to step, something you cannot do with airflow without Prophecy. So we’ll show you a demo of that. And finally, we’ll show you a demo of metadata search and lineage. How you can search across all the columns, the data sets and also we’ll show you lineage. How you can track values across your entire data engineering system. Now, the one thing that we also said… Showed you yesterday is GEM builder, which means that for Spark workflows, you can use the standard library, which is what we provide as the built-in gems, but then you can build your own. This is something that is available for Spark today, but will be available over the… In the next two, three months at, toward the end of summer on airflow as well.
And of course we, for Spark, we support both Skyline and Python. For airflow, it’ll just be Python. All right. And now if you don’t want to worry about any of these details and you just want to use a simple Low-Code system, you can use that and not worry about any of these details and that makes us more productive. With that, let’s move on to the demo. In the demo we’ll talk about schedule development, Low-Code Airflow development, where you’ll use drag and drop to build airflow workflows and then see how you can run them interactively and deploy them to airflow for production. Also, we will do demo of search and column level lineage at the end.
Maciej Szpakows…: Hi everyone. Thanks for coming. Yesterday we have shown you how to build a complete end to end data engineering pipeline directly in Prophecy in under five minutes. Today what we are going to do is take that pipeline and schedule it using Prophecy’s Low-Code Airflow Scheduler. So the first thing that you’re going to see when logging into Prophecy is of course your projects and the metadata screen, the center of your data engineering. Right here we have the workflow that we have created yesterday. And so let’s just go ahead and create a new schedule.
Speaker 3: Okay.
Maciej Szpakows…: Let’s create our schedule. Wonderful. So now what you’re seeing here is the empty white canvas on which we can start drawing our operators and the gems that we’ll do with the actual scheduling. So we can get started here. So what we would like to do is to essentially, around our workflow whenever new data shows up on the S3 bucket and then do some cleaning steps after that. If I go to the S3 bucket that I have prepared previously, essentially have our source data directed right here, and the target data there actually. Both of them are empty right now. So let’s get started by dragging our sensor down. This will be essentially continuously checking our S3 bucket for whether the new data has appeared there. Let’s just choose the S3 location based in some connection details and make sure that our sensor keeps continuing to see checking that S3 bucket as frequently as possible.
Okay. Now after that [inaudible] sensor has finished, let’s make sure to run the other workflow so I can just drag and drop our workflow, connected it together and the only thing that’s really… We have to specify within the workflow is just choose the workflow name. Now this worker is going to run an EMR or the data bricks of course supportive as well. If you don’t need to provide any configurations to the workflow, you can do it directly here and you can use all the predefined variables that you could use normally [inaudible].
Perfect, after running the Spark workflow, what I would like to do is do some cleanup steps to make sure our S3 bucket is empty. So let’s drag and drop our script component, which will just execute a simple Python script that will clean up the S3 bucket. This is the snippet that I’m going to use. It’s just going to open up that directory filter for all the files and delete all of it.
[inaudible], Now after all of that has been done, what I would like to receive is an email notification telling me that the workflow has succeeded. So let’s just drag and drop our email gem that’s going to be our success notification. But also in case of failure, if the workflow has failed, I’ll receive a notification of that happening as well. So let’s just drag and drop another gem, specify my email and that’s going to be our failure email. Wonderful.
So now that we are done with our data [inaudible], our schedule that we have just done in like three minutes, we can actually go ahead and run it to make sure it’s going to run correctly. Now Prophecy offers two different modes of running a schedule. One of them being the deploy, which is actually going to build a job around all the [inaudible] steps, push it to the correct release branch, and then make sure that your airflow schedule keeps running continuously at your specified configuration.
The other option, however that we have just went with now is the interactive run, which will allow you to quickly check is your [inaudible] is doing the right thing by just running it one time on airflow. So I can see directly all the steps running right here. So the very first step, which is my build of the workflow is happening right now, where the jar is being built. And I can already see that has finished successfully. Now my DOUG file is being created.
If I do want to specify any configurations in terms of when I want to start my workflow, how often should it be scheduled? Of course, I can do that directly from within here as well. Okay. Now, however, remember that Prophecy is not just a visual drag and drop tool. Code is behind everything that we do and also be between the schedule, of course.
So I can just speak on the code dub. And now I can see here, my actual Doug definition with all the functions for each operator, being neatly defined here for my run Spark and failure and all of those and the actual connections. I can see the graph components here as well. So I can see my success notification failure, the sensor, and all of that stuff has been generated for us here without any work from me. Okay. Now I can see that my sensor has already started and I can just hover over it to see the logs and they can see it’s already poking for my S3 direction. So let’s actually go ahead, open up that source data of the directory and real time let’s drag some data sets there.
Well that… Now going back to our schedule, what we’re seeing here is that the sensor will keep poking until it finds the data. Let’s just wait for that to finish. And just like that we can see that the sensor has finished. So now it’s picked and I would work for [inaudible] started the computation. So now the actual [inaudible] hasn’t been submitted to our cluster and we can see the status of that and all the logs related to it. And it’s about to actually get…
The job is about to actually get stuck. Now, in the meantime, while this is happening, it’s worked to remember that, of course, not only you can see that code here, but also on your direct [inaudible] that we have configured yesterday. So all of your schedules are visibly me directly here. You can go to the code, preview that code here, and of course edit it as well.
Apart from that for production jobs, what Prophecy provides is the [inaudible] monitoring screen. So directly from my data and metadata, I can just go to my deployment management screen. This allows me to see all of the schedules that has been created in the past and are actively running. So you can see that I have actually two schedules here that I have created before that are running some workflows daily and all of them are so far doing so good. And they’ve been running three times already. Of course, if I want to, I can disable them or I can dive into the airflow interface there. I can also see the [inaudible] history for all of those schedules and actually dive into the logs for any one of the tasks. So for instance, this schedule is running some workflow. I can very quickly inspect the logs from that.
Prophecy however does not introduce any module, right? So behind all of this stuff, there is pure airflow sitting in the background and I can inspect all the same dougs, logs and information directly there. Even the one that is running interactively right now, I can see directly here. So this is the preview for our doug that we had just needed. Right? You can see that our [inaudible] data already is finished. It’s running the workflow and is waiting for it to finish. And if I want to, I can, of course the code, which we have neatly inline here as well. So you can see the whole code here. And then of course, for any one of the doug tasks, you can just open it up and directly see the logs for it as well.
So that allows you essentially to do your data engineering as you were doing normally in airflow, if you are using it, you can just start using now Prophecy and become much more productive and actually building those dots. Now our airflow job here is about to get finished so we can see that already the workflow keeps running [inaudible]. And of course, if we want, we can inspect the logs better for that, right? So who wants to look at the EMR logs directly from airflow, we can do that directly as well. And there we go.
So our job has finished and successfully completed. Our component has been already [inaudible] marked. Now our cleaning up step has started processing. So if we open our S3 market again, we can see that our target data already contains our results that [inaudible] has the reports that we’ve been building yesterday. And our [inaudible] data has disappeared, essentially means that it has been cleaned up.
Okay. And the last step has finished as well. So now I should have received also a notification. So let’s just go back to my Gmail, let’s refresh it. And there we go, so we have our successfully competed step. Awesome. So this was essentially complete building pipeline in airflow that anyone can just pick up and start using and put your actual workflows in production.
Now, the question is what happens later, right? What happens once we have many different workflows that start operating between each other, producing data, reading, and writing and all that stuff, right? And we’d like to understand how the data flows. What Prophecy provides you is the meaning edge view. So we can just go in [inaudible]. I have here prepared a special Spark summit demo project, where I have two workflows here that are preparing some data and generating our report. And they can also see all the datasets that are related to those work flows within this project.
What I can do now is click on this icon right here to take me to the particular dataset, [inaudible] screen. So I can see that this customer’s data set is used by one customer amount’s workflow that is also using the orders’ data set and is creating two other data sets, [inaudible] customers and report. I can of course switch freely between those data sets to see how they are being created and how they are creating other downstream datasets.
Now, what I can do is also first choose any column out of those datasets in [inaudible 00:19:21] [inaudible 00:19:22] and see how are these columns being modified within this flow. So for instance, I can see that this customer amounts is modifying this last name column that they have selected. I can do a deep dive into that workflow and see that there is an aggregate component that in this case is just doing some [inaudible 00:19:39] on this last [inaudible 00:19:40] which is a person highlighted for you directly in the folder as well.
Now, thanks to the quick browse I can just quickly go back to the previous thing I was on and let’s actually try to look around for some examples. Right? So let’s say that I have this customer’s [inaudible] column, which is my private [inaudible] value information that I would not want to expose to my final report. I can just quickly select that column and I can see it being used within this dataset and flow to the customers as the same data set, but it does not flow to the report sets.
That’s great. It means that it’s not used by the data set. I don’t have to worry about that information. Another thing that I would for instance, want to do is inspect how my amounts are being computed. So I can just select my amount column in the report and I can immediately see that on the upstream, there is a sum that is computing that amount value. I can actually go in, open the workflow, open the aggregate component and see that the code highlighted for me right here, that is summing up the amounts.
So thanks to the professor [inaudible], not only you can now be much more productive when actually building your pipelines and understanding how they work, but now you can also figure out where all of your data and columns are flowing, whether all the [inaudible] values are secure. So with that, over to you Raj.
Raj Bain: Thank you so much for the demo Maciej. All right. So we saw how easy it is to do entire Low-Code data engineering with Prophecy, right? You saw local development of Spark workflows. Now you’ve seen local development of airflow workflows, and also for metadata. Also, you can see that all of them work well together, right? So with that, we want to talk about one more thing, which is a roadmap item, attentively we’re calling it Scheduler Gen. And this is something where we’ve had a lot of… Like we said, we work with fortune 50 companies.
And the problem that we run into is this right? They have 10,000, 20,000 workflows closed. They come to us and say, hey, do you want me to write all these workflows manually? And you already know what data sets each of the workflow is reading, but it’s one of [inaudible] is writing and so on. Why do I need to tell this information again? And what about cost management, et cetera? So what we are working on is that, of course, you know, from the workflow… Since we have lineage, we can extract the plan. But the plan that you see in the middle has data sets, which is being read by a workflow and that writes a data set the next workflow, et cetera, and so on. So you can see all your workflows in one place. Now, once you do it, right? You cannot spin up one new cluster [inaudible] workflow.
You know what? You’re going to spend a lot of money just spinning up and spinning down clusters. So now what we can do is we can be very smart and segment them and say that… And here you can see the purple ones, the green ones that we have created three different segments. And in one segment, all the workflows are going to run on a single cluster. When creating these segments, we are going to consider the size of the cluster beyond the amount of time taken and making sure that we meet any [inaudible] [inaudible] that we need to meet.
Right? So now once you do that, once you’ve done these segments, from these segments, we can generate the upload schedules. Which says you cannot run segment one run segment two. Of course, if something and segment one or segment two fails, you have to be able to restart from the middle. So there’s complexity there. Of course, you also need to have all the different workflows [inaudible] and do a single binary, to be able to run on a single job cluster.
So this is something that we’re actively working on, if you want to… We’re looking for design partners, we have one, we’re looking for some more. So if you want to try this and make sure that this works for you, come talk to us, we’ll work with you and make sure that we can… Over the next three to five months build this with you and make sure it works for you.
So with that as a summary, again, Prophecy provides local data engineering, which makes it super easy. Each one of the pieces works very well with each other. It makes the users very productive on Spark, on Delta and also on Airflow. And the code that you generate is standardized, and standardized on the standards you want, not just what we provide and you can be a [inaudible 00:24:01] get [inaudible 00:24:02] and all the modern software engineering practices. And finally as I said, the complete solution… It means that you are not stitching together your own solution from the parts, but get a full product. With that, we are really happy to take questions now.
Raj Bains is the Founder, CEO of Prophecy.io - focused on Data Engineering on Spark. Previously, Raj was the product manager for Apache Hive at Hortonworks and knows ETL well. He has extensive experti...
Maciej Szpakowski is the Co-Founder of Prophecy - Low code Data Engineering on Apache Spark & Airflow. He’s focused on building Prophecy’s unique Code = Visual IDE on Spark. Previously, he founded...