Your Tools Suck!! Re-Imagining Apache Spark Development

Enterprises used ETL tools for decades for higher productivity and standardization. Data Engineers see these tools don’t work anymore and have moved to code. However, we’re back again to ad hoc scripts and frameworks -reminding us of the world before ETL tools. We show how a new generation of tools can be built for Spark development based on code – for productivity, code standardization, metadata and lineage, and for agility via CI/CD.

Watch more Spark + AI sessions here
Try Databricks for free

Video Transcript

– Hi everyone. I’m Raj Bains, founder of Prophecy. And I’m super excited to be here with you today at this virtual summit. We’re super excited because we, this way, you know, with the virtual summit, we get to talk to so many more people. So super excited about that. Now, before we begin, I wanted to talk a little bit about my background, right? I have been in the industry for about 20 years. I started in Microsoft, right, Working on the Visual Studio team, mostly compilers, known as a great tool. Then I worked at NVIDIA. I was amongst the first engineers who built the CUDA stack and super proud of that. Worked in a new SQL database as an engineer. And my last job was managing Apache Hive at Hortonworks. Now, when I was doing that, you know, I worked with a lot of customers who were working on data engineering and trying to be productive and trying to be agile with that. And it was really hard. Right, what I realized, there is no matter what they were doing, a commercial tool or they built an in-house framework, no matter what they were doing, their tools really sucked. And we looked at that and said, “Hey, it’s 2020. You’ve got to be able to do better than this, right.” So we are working on reinventing Spark Development and we wanted to share, you know, how we can do that.

Data Engineering is driven by Code

Now, the first thing about data engineering and how it’s different from ETL is that data engineering is all about code. Right, so code is at the center of everything, right, you write your workflows, you have your schedules, you know, and it’s maybe CSED, but then there are a bunch of other things around in tooling that you need that aren’t quite there. Right? So, now the question becomes is how does somebody become successful with data engineering?

Why ETL tools suck!

So one choice is, of course, you know, you can use the ETL tools, right. So when people originally started, they started with scripts, right? Then ETL tools came along, organized it, made life much better, and now everybody’s abandoning them again. And the question is why? Right, so in our view, the problem here is this proprietary, intermediate format that they introduce. So when I’m writing a workflow, I am writing it in the format of one of the ETL tools. Right? So, and nobody’s really excited about it, right? When I want to write, like let’s say I’m a new college grad, I come, I’m excited, I want to come learn Spark. You are here in the Spark summit, like 35,000 attendees, right? You want to know more about Spark, become an expert in that, not in somebody’s tool. So, you know, who wants to learn that? The second thing is it’s really bad for pocket, right? So let’s say I’m a large enterprise. You know a large enterprises have thousands to tens of thousands of workflows. Now, if you wrote all these workflows in the format of one proprietary tool, you are stuck. Right, of course, you know, you will move from generation to generation a little bit, but they’re not going to perform very well. And finally, you know, they’re gonna say, “Hey, we can actually generate Spark code.” Like okay, so maybe you use, you know, just random examples. Maybe there’s Informatica, maybe use Talend, and they’re going to say, “Hey, we generate Spark code too.”

And what we’d say is, have you looked at their code? It’s sucks. Right? It doesn’t perform. It’s not readable. It’s, you know, and it’s not like if I don’t like the code, tough. You know, you can’t change it once you have it. So we looked at all of that and we said, “All right, we can do so so much better.” This is 2020. You know, we just, Tesla has made electric cars, Space X has just sent people up to the moon just last week, not up to the moon, to the International Space Station. Sorry. You know, and we’ve got to do better than these tools.

Prophecy Goals

So, what we at Prophecy are doing is we’re saying, “Okay, let’s set some standards. “In 2020, what does a good tool look like?” So first, since data engineering is all about code, you’ve got to be able to author high-quality code. What does high-quality code mean? The code is standardized. The code is performing, right? And it is maintainable because it’s going to live for decades in your (mumbles) And then it shouldn’t matter whether you do visual drag and drop. You know, a lot of people prefer that, or if you want to code, right? Either way, you should be producing high-quality code. The second thing which we think requires a lot of improvement is how much time does it take for you to develop a workflow? This is going from, I’m going to start writing a workflow to I have a workflow that is very developed, well tested, right? And that loop can be really long with data, with big data systems. And we think it can be made much, much better.

And finally, you come to deployment, right? This was the promise of data engineering, right? The software engineering has moved to agility. It has moved to continuous integration. It has moved to continuous deployment and, you know, data engineering was supposed to do the same for ETL. And the question is, can you deploy new workflows like five times a day? Google can do that. A few other Bay Area companies can do that. Even most of the tech firms don’t do that, right? And like I said, we just sent some people up to International Space Station, why can’t we do CSED? You know, we’ve got to be able to build this.

How to build Data Engineering tools

So with that, let’s talk about how we can build data engineering tools. So I’m coming from background of compilers, you know, having worked on them for a decade and, you know, compilers can do a lot. First, you can get rid of that terrible intermediate format. So, actually with coronavirus now that we are stuck at home, you know, and we, you know, go out for drives. It’s very safe, right? We go out for a drive on Highway One, got nice beach on the side, gives us a break, And we listen to podcast. Now I was listening to this podcast about Business Wars. And in that they were talking about the vacuum cleaner industry of all things and how there used to be these vacuum cleaners and they all had bags. And then Dyson came along and said, “Hey, these bags are the problem. “Like you’ve got these bags, you’ve got to, you know, “once the bag is full, you’ve got to empty it, reuse it. “It’s a messy process. “Then you’ve got to get new bags, replace them.” And they said, “You know what, if you just take the bags out, “the vacuum cleaners are so much better.” So it’s funny, I was listening to that and I’m like, that sounds so much like the ETL products. Now if you just remove this terrible intermediate format, you can actually build something really good. So what we can do is, you can have your Spark code at the center, and then you can get the same development experience that you would get out of a ETL tool with visual drag and drop with code. All of that, just with some Compiler magic and you don’t need to write to somebody’s format or get locked in. The other thing you can do is, you know there’s lineage and metadata, which also has taken a big hit as we’ve moved to big data systems, right? Most people, and this isn’t, even the Bay Area companies don’t have column-level lineals right. And you can compute all of that using parsing and compiling on the basis of your Spark. So, first thing is let’s build everything on top of the Spark code.

Re-inventing the IDE

Now, what does a new IDE look like, Which you have built using this? So first, if you are a person who prefers visual ETL, a lot of ETL developers have been doing, do that, you know, have been doing that for a decade and, you know, are quite productive at it. And they have a lot of the knowledge of the data in the organization, right? Either it’s them or it is, you know, somebody in machine learning who wants to do data prep or just a data analyst, right. They can very quickly develop an ETL workflow using visual drag and drop. So you put a few source nodes in the beginning. Then you put a few transformations in the middle, a few target nodes on the right. And there you have a workflow. Also, you can just press a button, connect to a cluster and just say, play and run through it. And I’ll show it in a second. Now, that still looks a lot like the visual ETL too, but where the magic happens is now you can toggle instantaneously between that and your code, right. And I’ll show that, you’re gonna to love it. So now once you go to the code, you can actually see that while you were doing the visual drag and drop, you were actually authoring really high-quality Spark code. And this Spark code is being written to Git, whether you are doing visual development or you’re doing code development. Either way, you are writing high-quality 100% open source Spark code project. And also this allows collaboration between different people in an organization. Now, the other thing which was really good about the ETL tools was there was some standardization of components. What that means is that often I have to understand as an ETL or data engineering person, I have to understand workflows written by somebody else in the team.

There’s some churn in the teams. Also, I typically need to understand it right when there is a big problem, right? Something goes wrong into production, I have to jump in, I have to understand the workflows. And if everybody’s code looks very different from each other, how are you going to understand it? Right, so building your workflows out of standard components that you know are high performance is amazing. You can understand code pretty quickly. Now, then you don’t have to be locked to a particular, you know, the standard components given by an ETL provider. What we also figured out is you can just make this extensible. You know, everybody can define their own components. You can say for my team, for my company, this is the standard 20 components. And I want 80% of my workflows to use them. And you know, or 90%, right? And then 10% can be something else, but then you have very standardized codebase. So with that, let’s move on to metadata.

Metadata & Column Level Lineage

Now, the other thing is, very similarly, you know your metadata can also be just derived from your Spark code setting on here. You can have parsers and compilers that read that code and build column-level lineage.

Now it’s super useful in a couple of scenarios. Right, so one what you can do is, let’s say I have a workflow where a value is wrong in the production system. Now I can pick a particular column and say, “What was the last workflow that wrote it?”

Many times, a workflow writes data sets with 1,100 columns.

Now the last two workflows that wrote this dataset might not even have modified the value. Right, I want the one before that, that actually modified that value. So I can chase it down using lineage. The other thing I can do is, if I have, you know, PII information, right? So this is, let’s say some social security number or an account number in a bank. Now, once it’s in one dataset, I want to quickly be able to say from this dataset, which other datasets did it go into and all of that can be built and computed from code and we’ll show you how.

Now, moving on to the big promise of data engineering, right?


Yes, you know, you can develop productively, but finally it’s about adding value to your business. Now, a good metric for a data engineering team to, for them to evaluate themselves is how quickly are they able to give data back to the business so that they can make analytic, you know, choices on top of that or decisions on top of that? So the first thing is continuous integration, right? So you’ve got to have this where you have workflows, you have unit tests. All of that is going to Spark code on Git. And every time you make a commit, you know, tests are run so you know your code quality is high. The next piece of this is continuous deployment. Continuous deployment means every, you have your workflows, you have your data quality tests, all of those are going as Spark Code on Git, right? And then, you know, every time you make an edit, you want to push something to production. You can do parallel runs, blue/green runs, say okay, this is how the performance compared, this is how the data compared. You know, these two columns were different and this is what the downstream impact is. If you have all of that in one place, it’s super easy to say, okay, push this to production. And then if something does go wrong, you’ve got to be able to roll back. But you know, this is again, not that hard and almost nobody has this today.

And if you’re stuck on Legacy ETL tools…

Now, finally,

if you can do, and you know, if you can use compilers to build these IDEs, if you have these compilers that can build metadata, the question is, what else can you do with tooling? Right? And one of the things we are seeing is these companies have thousands to tens of thousands of workflows. And as they want to move to Spark, they are manually rewriting it. Right. And we just look at it and say why? Right, you can have source to source compilers. So let’s say you have, you know, more complex would be something that is an Ab Initio piece of code, right? And so, you can even take the custom programming language, parse it, compile it, and write high-performance Spark for it. Or maybe you have some Informatica mappings, you know, which pushed down some code into paradata with SQL code that’s pushed down. You know, you can just, using a compiler convert it to high-performance Spark code pretty quick. Did you want to know how to do it? You know, come talk to us.

So with that, I’m super excited to move on to the demo and actually show you the product. We’re super proud of what we put here. So what I’m going to show is code standardization. Right, it’s like how code for Spark can be standardized in a way that is very understandable and perform it. Then we’re going to show you the IDE, how you can use visual and port editor and go between the two, how you can do interactive execution and debugging, and finally column leveling. So with that, I’m going to share my screen so I can walk you through the demo. All right. So now we are looking at Prophecy. This is running on Azure and running on top of a Databricks cluster or multiple database clusters. So right now we are looking at the metadata screen. This is the workflow stream and on the top right is the plus button where you can create new workflows, data sets, et cetera. So in this, now here, I have some of my recent workflows in my Hello, World! project. So I’m going to edit one of them. Once I press edit, it opens up in my visual editors. So so far it looks pretty much like a standard ETL tool, right? I can also execute it so I can start a new cluster and we can have different sizes of clusters on Databricks oron EMR. Right, or I can connect to an existing cluster. So to save time, let’s try to do that. And this is going to connect to this existing cluster that hopefully still up. All right, it looks like we are connected. Right, and then I’m going to hit play. And while this happens, let’s look at the workflow, right? So here we have the workflow that has two source nodes, which are reading two data sets. Then there is a join, combining them, a reformat that is doing some simple reformatting of the data, a little bit of aggregation, and then we write it to the target data set, Right, and at the bottom is job status. That’s going to show our job running. Right, so this is going to show the status of as the job is running. And now what you see is these blue icons up here, right? So what this is, is the data that is flowing through this workflow. So now I can look at any, so right after the join, I can press this and say, what did my data look like? Right, so here I am able to

click on my, now let’s close this. Okay. So this is my data and so far, and I can add new nodes from the top, from my toolbar. So this is very much like a regular ETL source. Now where the magic comes is that you can just press code. And now you are in the code for the same workflow. On the left hand side, you have the main graph and we’ve kept this short. And this is short very intentionally because in the beginning what you want to do is understand overall where your data’s coming from, where the data’s going, and what are the main transformations happening. You want to understand how the data is flowing through this. So here again, you can see, you have two source nodes and each one is producing a data frame. So all our components are data frame in and data frame out. So you read source one, you read source two. Both of them produce data frames. You did a join of them, then you did some reformatting here and aggregation, and then finally you wrote it out. So you can see one to one correspondence between this and the visual book.

But now what you can do is go to this prepare component, right click, say go to definition. And what you’ll see here is that each one of them is a function. We wrap it up in an object. So this is prep, this is color code, right? So you have a prepared component here. If you have an apply function that takes in a data frame and a Spark session and returns a data frame, B format is just a type test. So you know what component this is, but it’s essentially a Spark sequence select. Right, now, here is the other interesting thing. Now what I can do is I can start to edit this. So I can do a concat

and let’s see. So this is a first name so let’s put a label in front of it, so let’s say first name Collins and let’s get it all right. So now it looks like we have that input and we say okay. So we changed the first name to put a label saying it’s first name. Let’s save that. Now, once I’ve saved that I can go back to the visual view. And once I go back to the visual view, I can now click the reformat node. And what you will see is that this first name that we did a concat of shows up here as well. Right? I can see my incoming schema and I can do reformatting here. Also, you might not know Scala, right? So I can go back and I can pick SQL. And now this is just simple sequel expressions. Right, so any data analyst can write these. Also we are adding Python very soon. So in a few weeks, we’ll have Python as well. Now let’s try to go the other way, right? So now, similarly, what I can do is, I’m going to in SQL write a concat and let’s do a concat of, let’s say last name. Right? Very similarly I’ll do that. And all right, so we’ve got that, and then we want to concat it with the last name. Let’s see. Okay this looks good. So let’s apply this again. So we saved the workflow, right? We applied this, let’s save it. Workflow saved successfully. I can go back to the code, go back to the prepare component, look at the definition and my last name edit is there. So it doesn’t really matter whether you are editing code or the visual workflow. You are essentially doing the same thing. And now I might choose to run it again, right? Let me hit play here again. So now I have made some changes. I want to see if they work okay. Right, so this is going to have, you know, this is not submitting another job to the underlying Spark cluster, which is a Databricks cluster. So let’s wait for this for a second. It doesn’t take much. And again, as this job runs, we are going to see the data that is flowing through. The older data has been grayed out since that’s not, you know, that’s from the older version and now we can see the new data appear. So now, if I go again, look at the data after reformat, I can see that in the first name, the first name is appended. And in the last name, last name is appended. So in that way we can see the modifications. Now, the other thing that we wanted to show is, this is all being committed to Git. So I can go to this Hello, World! project. Right. I can see my workflows, my data sets, right. And I can also see what other new commits, so let’s see what the commits are, right. So I can go say, what were the latest commits and let’s open a couple of these and see how, what we can see. So we can see that the changes I made, whether they were done as visual or they were done as code, you know, I added first name, I added last name, and all of them got added as Git commits. So, in the end, you can just take the idea of it and build the code. With that, let’s move on to the final thing that I wanted to show. So let’s go back to our metadata. Let’s go to our data sets and here we’ll pick, let’s pick this data set. So for any data set, I can just go click here for lineage and I can open lineage. I think, let me hit a shift + reload here, just to make sure I have the latest version. So, okay. So I got this and now I can click on any data set and see the workflows that read and wrote from it. I can also pick a particular column, let’s pick last name, and now I can see upstream changes and downstream changes. Now what it also shows us that where the data set came from, right, and where it got edited and where it flowed to. So I can go double click on a particular workflow and see within this workflow, where was the column last name modified. I can see that this reformat is where it has been modified. Now let me double click on that. And now I can see the source code for that and the exact line which used that last name to produce a full name. So in such a way, we can track down all the columns across data sets and across workflows. So this is column-level lineage. So as a summary, you know, view in the R IDE you can see the visual code, the visual workflows, you can see the code for them, you can edit either everything is your code on Git.

Watch more Spark + AI sessions here
Try Databricks for free
« back
Raj Bains
About Raj Bains

Raj Bains is the Founder, CEO of - focused on Data Engineering on Spark. Previously, Raj was the product manager for Apache Hive at Hortonworks and knows ETL well. He has extensive expertise in programming languages, tools and compilers - as a member of the early CUDA team at NVIDIA, and as part of Microsoft Visual Studio team. He has also developed a language for computable insurance contracts. He knows database internals and has worked on NewSQL databases