Kelly O’Malley is a Solutions Engineer at Databricks where she helps startups architect and implement big data pipelines. Prior to joining Databricks she worked as a Software Engineer in the defense industry writing network code. She completed her BS in Computer Science at UCLA. Outside of the tech world, Kelly enjoys cooking, diy projects, and spending time at the beach.
This workshop is the final part in our Introduction to Data Analysis for Aspiring Data Scientists Workshop Series.
This workshop covers the fundamentals of Apache Spark, the most popular big data processing engine. In this workshop, you will learn how to ingest data with Spark, analyze the Spark UI, and gain a better understanding of distributed computing. We will be using data released by the NY Times (https://github.com/nytimes/covid-19-data). No prior knowledge of Spark is required, but Python experience is highly recommended.
Although no prep work is required, we do recommend basic python knowledge.Watch Part One, Introduction to Python to learn about Python.
This is the fourth part in our four-part workshop series, Introduction to Data Analysis for Aspiring Data Scientists. Today’s workshop is Introduction to Apache Spark.
And I just wanted to give a call out to the previous three sessions, three workshops. All the video recordings are available on our YouTube channel and that’s the link to check them out. I can also drop them in chat when I’m done presenting. The first part was Introduction to Python, second was Data Analysis with pandas, and third was Machine Learning. So I encourage everyone, if you haven’t already done so, to check out those videos. It’s a great four-part series and we’re excited to kick off the fourth part today.
So I know some of you are joining us on Zoom and some of us are joining us through YouTube live, so welcome everyone. This is just a call-out to join our Data + AI Online Meetup group, that’s where you’re gonna get notifications on all of our upcoming online meetups, our tech talks and workshops like the one today. So that’s the link there to join. And then also, for YouTube, to please subscribe and turn on notifications. If you like watching through YouTube, that’s a great resource to get notifications of upcoming online meetups and workshops. So far we’ve been broadcasting all of our content live through YouTube. So if you turn on notifications, that’s a great way to get notified of upcoming sessions.
So a few links, that’s the GitHub repo, that’s gonna have all of the content in the notebooks that you’ll need for all four parts of our workshop series, as well as, actually, all of our online meetups and tech talks that we run through this group. And this is also a link to sign up for Community Edition if you haven’t already done so. One other thing I’d like to call out is that we have two features that we’re utilizing, the chat, and then we also have the Q&A. So chat’s really good for if you’re having any audio issues, I drop links in there, so monitor that just for resources and general chatter, if you will. And then please drop all your questions in the Q&A. We have a few TAs here today that are gonna be helping out by answering questions. So drop those in Q&A, that’s the best way to get your questions answered as we can.
So without further ado, I’d like to have our TAs and our instructor give a quick introduction. So Denny and Brook, why don’t you you kick us off?
– Oh, hi there, my name is Denny Lee. I’m a developer advocate here at Databricks. Thanks for joining this session. – Hey, everyone, my name’s Brooke Wenig. I’m the machine learning practice lead at Databricks. Our team focuses on machine learning solutions for our customers, be it teaching best practices for machine learning or implementing solutions.
– Awesome, hi, everyone. my name is Kelly, I’m a solutions engineer at Databricks. I help mostly startups, who are usually new to Databricks start implementing their big data pipelines.
Is it okay if I take over and get going? – Yeah, definitely. – Awesome. – Stop my screen share and cool, go ahead. – So I’m going to share my screen. I’m going to start off with some slides and then we’re going to get into a Databricks notebook. So if you guys have not already signed up for Databricks Community Edition, now would be a great time to do so. The link was in the chat, but it is databricks.com/try-databricks. We’re going to be getting into that in about five or ten minutes. So it’d be great if you guys already have that ready and set up.
All right, just getting started with some intro slides, a little bit of motivation behind, first of all, what actually is Apache Spark, and then a little bit of Spark Architecture so we can really focus on Spark Fundamentals, Spark Architecture, Spark behind the scenes.
I wanna start off with a wide view of the Apache Spark ecosystem. Beyond just what’s at Spark Core, our core Spark APIs, there’s a lot of modules on top of Spark that make it a really robust ecosystem for any and all big data use cases. So we have things like ETL processes, machine learning, graph processing, as well as the ability to input and output data to all sorts of really popular data sources, the idea here being, when you have big data and needs, Spark will be able to fit in.
So a little bit of history behind Spark. I think that a really good definition in a nutshell of what Spark is, is a straight-off feed, Apache Spark website, it’s a unified analytics engine for big data processing with built in modules for streaming, SQL, machine learning and graph processing. And that’s what we saw on the previous slide, is the idea is to be this unified analytics. So not only are you doing your ETL, but you also have the ability to do this machine learning and graph processing. It was created in 2009 at UC Berkeley by a group of PhD students. And, actually, those same people went on to found Databricks, which is where we came from.
Currently, it has APIs for Scala, Java, Python, R, and SQL. We’ll be using a little bit of Python and SQL later on in our notebooks. And these numbers, actually, of who’s contributed, who’s built Spark out of 1400 developers from more than 200 companies, I would consider that a little bit conservative since Spark is an open source project. You can actually go and look at the code and look at the GitHub yourself.
All right, let’s get into a little bit of how Spark works. And as I go through this, again, if you have questions, please put them in the Q&A, that way the TAs will be able to see them. Again, any chatter can go in the chat. I’ll also try to keep this a little bit interactive, so I’ll have questions for you, and I’d like you to put answers to that in the chat. So this does say How to process lots of M&Ms. Yes, that’s not a typo, we’re going to be using M&Ms as an analogy here. So let’s start off with that. You might remember a game when you were younger when maybe a teacher presented you with a jar of jellybeans or M&Ms or small candy and there was a contest, if whoever could guess the closest to the number of M&Ms in this jar would win the jar.
We’re gonna do something similar here, but we’re gonna change the rules a little bit.
Instead of having to guess the closest, you have to get the exact number of M&Ms the fastest. And since we’re making things more constrained, you’re going to have access to these M&Ms. So let’s say we have a huge tub of M&Ms, we want to count them as fast as possible so we can win. Can you all please throw some ideas in the chat of how we might do that, how we might go about that.
And then, I can’t see the chat, so can one of the TAs please throw out some answers as they come?
– [Brooke] Yeah, so some folks recommended lining them up one by one or counting the rows and the columns, then multiplying the two together. – Okay, that’s really interesting, and that would definitely give us a solution. My concern there would be you’re counting one by one, so you kind of have a sequential count going on, it could take a while.
– [Brooke] Yeah, so some folks have already alluded to the Spark side of things, about distributing the counting tasks amongst us. – Awesome, so, yes, that is what I was looking for. We got there really quickly. So like what Brooke said, someone said, “Distribute the counting task among us.” So let’s think about it like that. Instead of just counting one by one the M&Ms yourself, let’s bring in some of our friends and let’s have all of our friends group together, everyone can count individual piles of M&Ms.
And that’s actually what I wanna use to talk about what a Spark cluster is. So this is the overall anatomy of a physical Spark cluster because, essentially, Spark itself is a way to do distributed computing. So we’re distributing our compute power across multiple computers, or nodes. So if we elude what this is to our M&M analogy, we have a driver, which is the boss, and that would be you. So you’re acting here as the manager. You’re in control of the M&M counting, you’re keeping track, you’re doing the scheduling. And then you have some friends who are doing the actual work for you. And so that’s what’s called a worker in Spark. It’s we have one driver to N workers. This can be variable depending on your workload. Within a worker, the actual work takes place on what’s called an executor. And this is just a Java virtual machine. In open-source Spark, there can be more than one executor per worker. Within Databricks, this is one-to-one. So for the case of this talk, we’re just going to use worker and executor interchangeably. But, essentially, what I want you to get out of this slide is that a Spark cluster consists of one driver, who’s doing the management, and then many workers who are actually doing the compute.
Awesome, so, yes, we are going back to M&MS, and the reason for that is we have a little bit of a problem.
So we have multiple friends, we’re telling them what to do, they’ve counted piles of M&Ms.
There’s an issue there here. Who can tell me again in chat what the potential issue is of each friend has their own pile of M&Ms. There’s an extra step that we have to do.
– [Brooke] Somebody said allocating the work. – Okay, yeah, that’s absolutely true, and that is something that needs to happen. But let’s say for the sake of this that we already have all the work allocated. – [Brooke] All right, and then other folks are chiming in, “Adding the count among your friends,” “Collecting the results and combining the counts.” – Awesome, that’s absolutely correct, is each of our friends have their own individual counts, but now we need to combine that all because I need to give my result back so I can win the M&Ms.
And that, again, goes right back to what a Spark job is.
So now that we’ve talked a little bit about what a Spark physical cluster looks like where we have a driver and many workers doing the work, we can break down what the actual logical Spark job is.
So let’s start off with our job, which is count the M&Ms. We wanna count them, and we already know this is pretty much two groupings of work, where the first one is each friend has to count their M&Ms, and then the second one is we need to get that final result back so all of the individual counts need to be aggregated and returned to me. And that’s exactly what Spark Stages are. So per a Spark job, there can be one or more Stages. And those are further units of work. The way that we define where a Stage starts and ends is when any sort of data needs to be exchanged. So let’s go back to our M&Ms. Once every friend has counted their group of M&Ms, we’re going to have one friend do the final aggregation, because, again, the workers do the work, I’m just telling them what to do. So when we have our final friend do that final count, he or she needs to get all the individual counts from each of the other friends, which means we’re exchanging data. And that’s called a shuffle. Anytime there’s a shuffle, we’re going to have a Stage boundary. In this case, we only have two Stages because we have individual counts and then aggregation.
In Spark, we also further breakdown Stages into what’s called tasks. And these are just pretty much the smallest unit of work that Spark can do, that’s an individual compute core or compute power operating on a subset of the data.
We don’t have to get too much
into exactly where tasks are broken up right now. I just wanted to make sure we have some anatomy of what a Spark job looks like. And now we’re actually going to move into a notebook. So I’m going to show you what all of this looks like in practice ’cause I think that’s the best way to learn.
All right, if you have Community Edition up, we’re going to move over there now. I’ll give everyone like a minute or so as we get moving to make sure that they’re fully set up, and then I’m going to go over everything, how you can import the notebook, how we can start up a cluster. I don’t wanna leave anyone behind here.
So just like a quick tour of what we’re looking at is, this is Databricks Community Edition homepage. There’s a couple of shortcuts on this page that we can look at. What we want to do is create a cluster. We talked a little bit about what a Spark cluster looks like. And the important thing to know here is to run Spark code, we need a cluster to run it on. So our notebook contains Spark code, let’s create a cluster so we can actually run that Spark code. So on this clusters page here, I’m just going to go ahead and click on Create Cluster. And then this gives us some configuration options for our cluster. We don’t have to do anything specific, I’m just going to give it a name and then I’m going to go ahead and click Create so this cluster can spin up. I’m gonna take about a minute or so because we have to do some underlying set up, but it’ll pop up on my page if I do a quick reload. So while our cluster is starting up, I just want to go back in and do a quick review of our cluster in general. So what we’re doing is we are setting up a local-mode Spark cluster because we’re using Community Edition. So essentially what we’re doing is we are running Spark, we’re doing this driver work, or architecture, but we’re doing it on a single local machine. What this means is that Community Edition is really great for prototyping, it’s great for students who are learning Spark. If you want to take Spark into production for a production job, you’re going to want to use some version of Enterprise Spark, like for example Databricks.
So as this cluster spins up, and this is, it’s actually already started for me, this little green dot means we’re good to go, is we’re going to import our notebook so we can run. I think the TAs are going to put this GitHub link in chat, but this is a link to the Intro to Spark notebook. This notebook will be available for you after the talk as well. If you’d like to follow along, then please go ahead and just go to this GitHub page now. All you have to do is copy the URL here, and then back in Databricks, we’re going to click on this Workspace tab. And what this is, this is kind of a file structure where all of our notebooks are stored. So you can actually drop in the notebook right at this high level. I’m gonna follow some best practices, I’m going to put it in my own folder even though I’m the only user on this workspace. So I’m just going to go into Users, my Kelly folder. I already have a notebook in here, I back this notebook up just in case, and I’m going to import my notebook. So again, if I click this little drop-down next to my email address, I can just click Import. And then there’s two options, Import from File, Import from URL. I’m going to paste in that GitHub link, it’s a iPython notebook and click Import. And here it is.
I’ll give everyone a second or two to make sure we’re all at this point so we can go ahead and follow along once we go through the notebook. Something I also wanna add is this is a pretty long notebook, I don’t expect to get through the entirety of it. As we get further down into the notebook, there’s some more detail analysis, some more detail information on some more advanced Spark APIs. I documented it well, so the idea being that you can go back after the fact and have links to documentation to learn more and do your own analysis.
All right, so as we’re getting into this notebook, like I said, we created a Spark cluster, we have a Spark code, some Spark code, so we have to tell our Spark code what Spark cluster to run on. So what I’m going to do is I’m just going to click this little drop down where it says Detached and I’m going to select the cluster that I just created. That’s all there is to it. And now what I have is a notebook that has Spark code running on a Spark cluster. Some notebook fundamentals I wanna go over just before we go through this is this is a interactive notebook, so each cell can have code in it. I’ll go over what all of the special commands like what does percent sign mean as we go, don’t worry about it. But what’s important to know is that when you have a cell of code, there’s two ways you can run it. You can either hit Shift-Enter and that will run the code, or there’s also this little Run button, you can click that as well, and that’ll run the cell. Something else I wanna point out is if you wanna add a new cell in the notebook, if you just hover over the bottom of an existing cell, there’s this little plus button. You can click that and that’ll give you a new cell to write some code in. All right, so high-level overview of what we’re going to go through in this notebook, we already did our Intro to Spark slides, we had an introduction to what a physical cluster looks like, the anatomy of a Spark job, then we’re going to talk about a little bit of data representation in Spark, ’cause it is different than other tools like pandas and I think it’s really important to know. And then we’re gonna talk about, again, our distributed account. We’re going to look at exactly what that looks like in Spark, a little bit about how Spark does operations, and then some Sparks SQL. If we have time at the end, there’s some more Python code, that’s some more interesting analysis we can go through.
So the first thing I want to mention is I already ran this first cell, and this is actually not Spark code. This is just interaction with the file system. And so that’s what this %fs means, is I’m saying, “Hey, file system, do an LS.” And the reason I want to do this is to show the dataset that we’re going to be operating on. It is this COVID-19 data. The Databricks dataset file folder is available to everyone using Community Edition. There’s a lot of really interesting COVID-19 data available to you. So after this Intro to Spark, if you want to do some more interesting analysis on any of this data, you’ll find it in Databricks datasets as well as other datasets as well for you to mess around with. The specific CSV that we’ll be using is this us-counties CSV. And what that contains is COVID-19 data from “The New York Times,” broken down at a county-by-county level. If you were at our previous Intro to Data Analysis sessions, this is actually the same dataset that we use last week with the scikit-learn analysis. You might already be familiar with it.
All right, let’s get into some data representation with Spark. So this beginning picture is a little bit cluttered, but I just wanted to point out a couple things, things that we already have and things that we still need. You can see along the left here Environments, you need an Environment to run Spark. Databricks takes care of that for you. We can remove the need for this Environment over here. Then we have Data Sources. We need to have some data do operations on. Usually, Spark is for big data. We’ve also taken care of that for you. There’s some data in Databricks datasets. It’s really just sitting in an S3 bucket, so that’s available to you as well. So now all we need to focus on is this Workloads section, and that’s where our actual compute and the actual Spark engine comes in. So we have Spark Core at the core, and that is the core Spark engine, the core APIs. What we wanna do is get at a higher level and look into how we can really manipulate this data using Python and SQL. So there’s two APIs that I’m going to talk about right now, and that is the RDD API and the DataFrames API. And you can see they’re kind of stacked, as in the DataFrames API does exist on top of the RDD API. So let’s talk about RDDs first. They are something really important, they’re also something I don’t want to spend too much time on, because they are somewhat of an older, deprecated API. So when Spark was created, RDDs were the original form of data representation. RDD itself stands for a resilient distributed dataset, and even the terminology is really important, because it describes how Spark data is represented at a very high level. Let’s talk about each term individually. Dataset is pretty self-explanatory, it’s a collection of data. We have a distributed dataset in that, again, a Spark cluster is multiple nodes or multiple computers that work together to process data in parallel. So what that means is we have our dataset distributed amongst those nodes. So each node can operate on a chunk or partition of data. And then, I think most importantly, is this resilient term. So what resilient means in the context of an RDD is our data is fault tolerant. This is made possible because RDDs, once created, are immutable, you can’t change an already created RDD, you have to create a new one if you wanna make a change. RDDs also keep track of their lineage, so they keep track of any operations that were done to produce the RDD data that you’re working with. And what this means is, if we lose a node or one of our workers, we can reconstruct that data and our overall Spark job isn’t going to fail. This is really important because, again, with Spark, we’re operating on big data. It’s a very heavy compute workload a lot of the time, and we don’t want to have our entire job fail if just one of our pieces of hardware fails. Something I do wanna point out about that is I’m saying, yes, you can recover from a failure, and that’s a failure of a worker. There’s also the Spark driver. If the Spark driver fails, we’re in a little bit more trouble, ’cause then our Spark job will fail if we lose the driver.
All right, now that we talked a little bit about RDDs, I wanna talk about DataFrames. So I believe DataFrames were introduced around 2015, and they are the data representation that I would recommend using when you’re using Python with Spark. And they’re the data representation that we’re going to use in this notebook. So there’s a bunch of reasons why DataFrames should be used over RDDs, although I do wanna note that under the hood of DataFrames, RDDs are there. They are worth some and used to do the under the hood computation, we just abstract that and optimize that with Spark. So with DataFrames, they have higher-level APIs, which means they’re significantly more user-friendly, much easier to learn, but they’re also more powerful. And I think that’s important, is we not only have this user-friendly mode, but it’s also heavily optimized and much more performant. And we’re actually, later on the notebook, going to get into why that’s possible. The last thing I wanna leave you with was some motivations on why DataFrames are important, is this graph here is actually just a timing graph in seconds of time that aggregate into pairs. And we can see at the bottom in those red charts, with the RDD API, this is the time it takes in Python. It’s slow, Scala’s a lot faster. When we look at DataFrames, all four of these languages are the same time and they’re all faster than RDDs. SQL and R aren’t listed under RDDs because they weren’t actually supported with RDDs. So again, another reason to use DataFrames, they’re faster, they’re easier to use, and they provide more language options.
Awesome, so let’s get started. Let’s actually create our first DataFrame. So I’m just gonna run this cell, and what we’re doing here is we’re using a DataFrame reader. So just going through this code, what I’m doing is, this isn’t just my variable, it’s the COVID data frame, and I’m telling Spark to read my data. This was the same path we looked at before. This is specifically, I’m reading in a CSV, so this is the DataFrame CSV reader. There’s different DataFrame readers for lots of different sources of data, so if you’re reading something in Parquet or JSON, there’s a DataFrame reader, so you would do .parquet or .json instead of CSV. Each DataFrame reader has its own options, and we’re gonna take a look at that in a second, because what I want to show you with this is we just did a blind read of our CSV file, and then I did Show, which shows our data that we just read in. You can see there’s a little bit of problems. There’s actually two problems here I wanna fix. The first one is, pretty clearly, we have our header of our data is not the header, it’s the first row. We have some weird column names going on, we wanna fix that. The other problem that we have is, if I click this little drop down here, we can see that all of my columns are string-type. So I just did a blind read of my data, it just auto-assigned string. I wanna be able to do maybe some integer addition on the integer columns or maybe some plotting with this date column, so I want a more defined schema, or types, for each of my columns.
So in order to do this, what we’re going to do is let’s take a look at the actual Spark documentation. I wanna show you what that looks like so we can figure out what options we have to pass into the CSV reader. I’m just gonna go ahead and open this link up here.
And so this is the home page for the latest Spark docs. I wanted to start off here to show you exactly how to get into the API Docs. So if I click API Docs and then Python, because we are writing Python code right now, then I have a whole bunch of package options. Rather than try to hunt down the CSV reader, I’m just going to search for CSV. And then you can see it popped up right away, the DataFrame reader and the DataFrame writer. I know I’m reading a DataFrame, so I’m just going to click on this.
So now what I can see is all the parameters that the DataFrame reader can take in. And I know I want my first row to be a header and I want a better schema than just all strings. And that’s actually these two options right here, use the first line as new columns, perfect, and then infer a schema. So that means, “Hey, Spark, take your best guess “at the schema of this CSV file.” So we’re going to set both of these to true and see what we get.
I just run this with header equals true and first schema equals true, perfect. We can already see I have good column headers, date, county, state, FIPS, cases, deaths. And then if I look at the types in my columns, it looks like the schema was inferred perfectly. I have a date column, some string columns, and some int columns. Before we move on, since we’re going to be working a little bit more with this data, I wanna quickly go over the columns in this data just so we’re familiar with it.
This is from “The New York Times” on a county-by-county level. What we have is, through each row, a date, so that’s the date-time that this row was valid for, we have county and state. FIPS is a bit of a weird one, that’s something that the census does in order to give each county in the United States a unique numerical identifier. We’re not gonna use this right away, but it is used further down in the notebook because it’s actually a really good way to have a unique identifier for a column. So if we want to maybe join it with some other data, we can use that FIPS code because it is unique to a county. We also have cases and deaths, and something to note about this cases and deaths column is, each day, “the New York Times” adds a new row for each county. That is the cumulative cases and deaths up to that day. So if we want the most recent data and the most recent data only, we’d need only the most recent date for each county.
All right, so let’s get into an actual distributed count so we can look at what some Spark code looks like, and then I also want to show you what the Spark UI is and what that looks like. That’s a pretty important way when you’re writing code to optimize and debug. So instead of just counting M&Ms, let’s go ahead and do the count of the number of rows in our DataFrame. So this here, it’s our DataFrame, we can see it. We’re only showing the top 20 rows, though. There’s a lot of counties in the US, and we know we have a new entry for every day, so I’m guessing there’s gonna be a lot of rows. Let’s see actually how much data that we have. So before I run this, this should be very similar to our M&Ms count, even though we’re counting real data. Once we do our Spark job, how many Stages would we expect? Can y’all please put that in the chat of what we’re looking for?
All right, and unfortunately I can’t see the chat. Brooke, are there numbers being put in there? – [Brooke] Four, two, three, a few threes. – All right, so, yeah, let’s think about this a little bit. Actually, is let’s look back to our M&Ms. And if we remember, our M&M count happened in two Stages, and the reason for this, again, was each friend counted their own pile of M&Ms and then Stage two was an aggregate. And, actually, this count, even though we’re doing it on rows of a data frame, we can equate DataFrame row to M&M, is we’re going to have two Stages, just like our M&M count. Because what we’re going to do is each worker is going to count their own pile of rows, and then Stage two, one of them is going to do the final aggregate. So if I run this count here, it’s just going to it as Run, you can see I ran a Spark job. You can see we have eighty-nine thousand rows, which is quite a lot. And then we can actually look at our Spark job. And this is when we’re gonna get into the Spark UI. So I can see I ran one Spark job, that’s my count, awesome. If I hit this drop-down, we can see, yes, we had two stages. And, again, what this is is the Stage one count, each worker counts their own pile of rows, and then Stage two, we’re going to aggregate each of those individual counts. This View button here is actually a link, so we’re gonna click on it. And this is what’s called the Spark UI. Within a database notebook, we provide a host of Spark UIs. So you can just go ahead and look at any of your jobs. But what we can actually see for job five, which is our count, there’s a lot of information about it and there’s a lot of tabs in the Spark UI. The Spark UI can be very confusing because it does provide so much information, so I wanna focus now on just some really important things, like right now, looking at the two different Stages. So we can see Stage five happened, and then on top of that, Stage six happened. So this first Stage here, we just know because of our knowledge of how a count works, that this is each individual worker counting up their pile of rows, not M&Ms, rows. And you can see there is only one task. Normally, we would, for a count, expect this to be a lot higher, but because our dataset’s actually very small, we only have one task.
Our second Stage is this final aggregate. This is just one worker counting up each individual count.
Something I do wanna point out here is this idea of a Shuffle Read and a Shuffle Write. So look, I told you earlier, a Stage boundary happens when there’s a shuffle or an exchange of data. This is actually a two-step process. It’s specifically a write, and then a read. So after each of our first workers does their individual counts, they do what’s called a Shuffle Write, and what that does is it makes their individual counts available to the worker who’s going to be doing that final aggregate, so that final worker starts off their Stage with a Shuffle Read. And that’s how they get all of the individual counts of data so they can see the data. Read and write is the exact same, exactly like we’d expect.
So it’s just a little bit of insight into the Spark UI. We could click into one of these links here and see more task information, just things like timing. We’re going to get a little bit more into the Spark UI in about two minutes or so so I can show you what a query plan looks like, but just know that there’s a lot of information available in here for you.
All right, let’s actually get into some interesting Spark code. Let’s actually do some analysis here. So I live in Los Angeles, Southern California. I happen to know that Los Angeles will have quite a few cases of COVID-19, so I want to only look at the data for my county. And then, like I said earlier, the most recent information is on top, so I want to sort by date. So the most recent information for Los Angeles County will hopefully be on the top of my DataFrame that I create. So I’m just going to go ahead and run this code. And what I do want to point out, with this Spark code right here, is if you’re familiar with something like pandas, this does look a little bit different syntactically. It is still Python code. It’s a specific version of Python called PySpark. The syntax for how you reference columns, how you run your functions is slightly different, so it might take some getting used to. But, essentially, what I’m doing here is, I wanted to sort by the date column descending. So I have the most recent date on top, and then I wanna filter for only the County of Los Angeles. And I ran that, it took 0.15 seconds, and nothing happened. There was no Spark jobs created, unlike here. So this is something very interesting. I tried to do a sort, tried to do a filter, I got no results.
And this is actually one of the core ideas of Spark, is if you were writing this kind of code on a pandas DataFrame, for example, if you’re familiar with pandas, as soon as you do a filter, if I had a pandas DataFrame filter for Los Angeles, I always get a result printout immediately. With Spark, nothing happened. And this is because, in Spark, we have operations broken down into two different categories, the transformations and actions. So pretty fundamental to how Spark works, and we’re going to get into why this is the case, is that transformations are lazy and actions are eager. And what that means is, if I run a transformation or if I define a transformation on my DataFrame, absolutely nothing will happen. I have to run an action and/or trigger my transformations.
So I can tell then, when I run this same code as before, because nothing happens, I know both sort and filters are transformations. So I’m not actually going to get any of this sorting or filtering to happen until I call it an action.
All right, so let’s talk a little bit about why transformations and actions actually exist and why Spark is what’s called lazily evaluated. So there’s a couple of benefits to being lazy. The first one is, with Spark, we operate on huge datasets sometimes. So if you wanted to do an eager filter like pandas does, so the second I call a filter, we filter all the code, we go in, do the work, we might have a problem because we’d have to read in all that data. It might be impossible depending on how big our dataset is and depending on the size of our Spark cluster with the amount of memory we have. Being lazy, we don’t actually have to read in any data until we know exactly what we’re going to do with our steps of transformations. It’s also easier to actually paralyze the operations themselves. I think I keep saying this, is we’re working on really big data. But this is why this is all necessary, is you have to do things a little bit differently to be able to process terabytes and even petabytes of data, is when you are defining transformations, like in my case, I did a sort than a filter, I could do a sort and then a filter and then a filter and then a sort and then the map on my data or end different transformations like it says here. And I could kind of pipeline that on a single data element on one of my worker nodes so that can all run without me having to filter, get the result, sort, get the result. It can just all happen, happen, happen on the same chunk of data. So we’ve sped things up considerably using this method. And then I think, most importantly, is what it give Spark as a framework the option to do. And we’re going to get a little bit in how Spark does its optimizations. So there’s something that’s called the catalyst optimizer, and that is the core optimizing engine for Spark, just some Spark software packaged in. I don’t wanna get too much into what the catalyst optimizer can do and how it works within the notebook. There’s a link to a blog that has a lot of really great information behind the scenes. But I do wanna talk a little bit on how it will help us here. So the first thing I wanna mention is you have to use DataFrames or another API that the catalyst optimizer can support in order to make use of this optimization. So that’s another reason why you wanna use DataFrames or RDDs, is this is what gives us the speed-up that we were looking for on that graph earlier. So what the catalyst optimizer can do is, when we have this chain of transformations like what I did before, it was only two, it was a sort than a filter, and then I call an action, which is what we’re going to do in a second, the optimizer can look at our chain of transformations and it can maybe say, “Hold on, this doesn’t look right,” and it can make some tweaks. It can do everything from rearrange the order that things have happened in to choose a more performant version of sort. And this just happens behind the scenes for you, so the actual code that executes might be a little bit different than what you write. So I wanna talk at a high level about what the optimizer does. It essentially it takes a look at your code, it parses it, says, “Okay,” in my case, “You did a sort and then a filter,” it tries to optimize that if it can, and then it comes up with a physical plan of how it’s actually gonna do that on the hardware. And then it just does that. So you can see here, it takes in the DataFrame and then it takes in whatever you’re doing to that DataFrame as input. I do wanna briefly mention the existence of a dataset. So what a dataset is, is it’s another high-level API that you have access to in Spark. The main difference behind the DataFrame and a dataset being is that a dataset is a typed digital representation, as in you need a typed language to actually use that. This is a bit of a generalization, but it’s essentially just a typed DataFrame. Because we’re using Python, Python is not a strongly typed language, we can’t use datasets. So as far as we’re concerned for this talk, DataFrames are our option. And when you’re using Python, DataFrames are the best option to use.
So I’ve been saying a lot, I did a sort, I did a filter. This will trigger when I do a action. Let’s do an action.
So here I have my same code as before, I did a sort, I did a filter, show us an action. So you can see I actually ran a Spark job, and this is great, this is exactly the output I want. I have the most recent date on top, these are all only Los Angeles, it worked. And I’ve been seeing a lot of flashes of chat coming up. I can’t see the chat, so I don’t know what people are saying. I do wonder if anyone called out my bad code, because I did have bad code here. This is not the most optimal way to do things, and I wanna show you how the optimizer can fix some mistakes. So what I’ve actually done is, I said, “Okay, I want only Los Angeles. “I want the most recent date on top,” but the way I linearly wrote this code is I sorted first and then I filtered. So if we think about what this means is, what I told Spark to do is, “Okay, here’s all of your data. “Sort all of it, and then throw out a bunch of it.” If we’re working on big data, that would be really inefficient because I would probably wanna throw out a lot of my data and then do the sort so I’m sorting less data. So we can actually see the catalyst optimizer realize this and fix it for me. So I’m gonna show you how to do this. If, again, I click on this drop-down, I click on View next to my Spark job, there’s this associated SQL query. It’s SQL query 12. If I click on the link, I can see exactly what’s being done, and this is the final physical code that’s run.
There’s a lotta information under this Details tab. I don’t wanna go through it all, but what I do want to show you is the exact point where the optimizer said, “Hey, this isn’t good.” So if we read from the bottom of what I did is I said, “Okay, sort and then filter.”
The logical plan is, again, sort and then filter as soon as we optimize that logical plan. So as soon as that catalyst optimizer takes that optimize step, you can see filter and then sort have been flipped. So what actually happens, even though on my data I said, “Okay sort then filter,” the Spark code, that runs filters first and then sorts. So even though I wrote unoptimal code, the optimizer took care of it for me.
Perfect, so let’s talk about Spark SQL now. And I think we’re actually running perfectly on schedule. So I wanna show you some SQL code and how you can use SQL to query Spark DataFrames. This is an ANSI compliant SQL language, so if you’re used to SQL, this’ll be very familiar to you. We’re not gonna write complex SQL at all, though, so don’t worry about that. The first cell that I’m going to run is this createOrReplaceTempView. And so what this does is, our COVID DataFrame, it was created using Python. That means it’s only available to us using Python. In order to create something that we can query using SQL, we’re going to use this createOrReplaceTempView function with the name of the SQL table that we want to be able to query. And really all this does is it creates a temporary view on top of our Python created DataFrame that then we can use SQL to interact with. The name I gave it is “covid.”
Awesome, so the first thing I wanna point out is this %sql command at the top. We’re running in a Python notebook, that doesn’t mean we can’t mix and match languages. So the way we’re going to do that is just %sql, that’ll tell us the code below in this cell is SQL code. So now that I have that, I can run a SELECT * from COVID. And what that does is just, is that give me everything in this COVID table, * is everything. So I can see here this actually looks very familiar to what we saw when we initially printed out our dataset. This the same data, it’s just formatted a little bit differently, because behind the scenes of Databricks Notebooks, we’ll take a SQL query, and it’ll use a different method of displaying our data that actually gives us a lot more functionality. And we can explicitly do this with Python as well. I just wanted it to you to show it to get started with. So this is called a display, and the reason that I wanted to do a display here with our SELECT * is because I want to create a graph. Because we have date data, we know we have updated counts, as dates go. It will be really great to have some sort of view over time on what our cases look like. This is all of our data, but next, we’re gonna filter down by county. And that’s actually why I have this little comment here, with keys, grouping, and values, because what we’re going to do is we’re going to use the UI to create a graph. So on this little drop-down, that’s where is this little bar chart and then a little down arrow. We’re gonna click Line. This is not a useful line chart to us, so we’re going to do some options so we can behind-the-scenes create our graph. So if I click on Plot options, I can see there’s some keys, some value, series groupings. This is not what I want. I’m just going to get rid of them. Then now what I’m going to do is drag in the keys, grouping, and value that I do want to create my graph. And what this is going to do behind the scenes, is this is actually going to kick off a Spark job with all of these groupings. And you could write a SQL query again to do the exact same thing and then plot it yourself. So it’s just kind of a shortcut for us. So like it said in the comment, you want date to be along the bottom, you want time series, you wanna group by county, and then I want my values to be cases. And then if I click Apply, it’s running a Spark job, running a couple of Spark jobs, actually. It could be doing quite a lot of groupings. And then once this is complete, what we’ll be able to see is a graph, a line chart, actually, by county. So if I expand this a little bit, because we have so many counties, we’re only showing the A names. But we can see over time, if we hover, Adams County, wherever that may be, seems to be trending up the most right now.
All right, so like I said earlier, I wanna filter down to only Los Angeles, where I live. So this is, again, it’s the same code as before, select everything from the COVID table, the difference being, I want County, Los Angeles. And please feel free to replace that with whatever county you live in, same idea. This is great, it’s Los Angeles only. Let’s plot it, and this time, I’m gonna plot cases and deaths on the same graph. So if I click Line, Plot Options, let’s do series grouping. Again, we’re grouping by Los Angeles. And let’s do cases and deaths, be a bit morbid.
And we can see over time, it’s trending up pretty sharply, which makes me not so happy. But it looks like a maybe a little bit at the top is starting to level off, I can hope. But this way you can see with your county or wherever you live, this is a great start from some analysis, just some simple plotting.
The last SQL query that I wanna show you before I’m gonna take any outstanding questions, and I’ll quickly review what’s in the rest of our notebook, is a little bit more of a complex query. And this is doing that group by and order by without relying mainly on the groupings and the plot options to do it for us. So what I’m going to do here is, because we have an update each time, each date in our dataset, I’m just going to grab the max for each county. So that’s what this does, is I wanna group by county. So for each county, I want to do the max number of cases and the max number of deaths. And then I’m going to do is, I want only the counties that have the most cases. So I’m gonna do an order by county, I’m sorry, order by max cases descending, so the most is on top. So what I should now have is a DataFrame with a county, the maximum number of cases, and the maximum number of deaths for that county. And it’s sorted, so I can see that New York City actually is the hardest hit. This is another county in York, and then Cook County is the county that contains Chicago. So if I wanna view this visually, all I really have to do is click this little bar chart option at the bottom, it’ll just automatically create, it makes a graph for us, and we can really visualize how many cases New York City has with this.
All right, so I’m going to not go through the rest of this notebook, because it is well documented, and I’d like everyone to take a look at it further on. But essentially what it does is, it gives you some ideas for how to get started on analysis using Spark on your own. What it does is, it just pulls from the internet some census data, does a join using that FIPS code I pointed out earlier, and then just does some interesting analysis on case rates by population. So that is something that’d be great if you would have time to look at. There’s also, I tried to use a variety of Spark APIs to just point out a little bit of variety that can be done with Spark and different types of analysis that you could think of doing.
Awesome, so we have about five minutes left. Are there any interesting questions, Brooke or Denny, that you think will be good for the entire group to hear? – [Brooke] Yeah, definitely. So one of them is about a high-level concept of when to use pandas versus Spark. – Yeah, so that’s actually a great question. And that’s something that I hear a lot from people who are more familiar with pandas, people who are more familiar with traditional Python. The main difference between a pandas DataFrame and a Spark DataFrame is that the Spark DataFrame is distributed. So when we’re looking at pandas and we’re looking at what’s called single-node computing or just doing all of the work on your own laptop, that’s when you can use something like pandas is maybe the data all fits in memory, I was actually told the other day that if you’re doing compute with a pandas DataFrame, you wanna make sure that the size your pandas DataFrame is significantly less than the size of your computer’s memory so you can make sure you can actually do the computation. But if you’re working with big data where it might not fit into the memory of one computer or your pandas computations just aren’t working, that’s where a Spark would come in. That’s what enables you to be able to do computation on huge terabytes of data or even just multiple gigabytes. Your computer’s running slow, there’s a lot of data, you Spark. I do wanna quickly shout out another open-source project, it’s called Koalas, named because it’s another cute, fuzzy animal. And what Koalas is, is it’s actually a way to write pandas code using the pandas syntax, because like I mentioned earlier, pandas syntax is a little bit different from Spark syntax. So Koalas is a bridge between pandas and Spark. Behind the scenes of Koalas, Spark code is running, but the syntax, it’s the same as pandas. So if you’re very familiar with pandas, you wanna move into Sparks, I would take a look at the Koalas project. And I also know, but I think maybe early June, there’s gonna be a workshop on Koalas.
– [Brooke] Great, and then another question that came through about, with all the optimizations provided by the Spark DataFrame API, is there still room for the programmers to further optimize their code? – Oh, definitely, there’s a lot that can be done. I didn’t get into discussion on partitioning or how you wanna set up your Spark configs, but there’s a lot of thinking about how shuffles can be done, when you wanna choose to use a shuffle, because, again, this is a data exchange, it can be expensive how you want to break up that data. That’s something that, as you move more into Spark, you’ll become more familiar with, is where and when you want to optimize and where you wanna step back and let the catalyst optimizer do its work. – [Brooke] And one thing I’d like to add on to that too, is a lot of the optimizations have to do with how you lay out your data on disk too. So it’s not necessarily changing your Spark code, changing your Spark configs, but it’s looking at how have I partitioned my data physically on disk. All right, and then some other folks are asking about how is Spark different, open-source Spark versus Spark on Databricks? – Yes, so that’s a really good question. It’s the way Databricks likes to look at things is, any and all, API is needed for usability is open-source. What Databricks does is it add some extra features to do things like improve performance, make it easier to use. So like you can see here with this notebook, we provide a fully-hosted Spark environment, and that’s why I didn’t have to get into behind-the-scenes exactly how we’re going to wrangle a cluster together or even talk about a resource manager, because that’s something that Databricks will do all for you. So even though the Spark running on Databricks is open-source Spark with some cherries on top, it’s really the hosted environment, these notebooks, all the other features that differentiates Databricks. – [Brooke] Great, and then there’s a question about machine learning integration with Databricks. Let me know if you want me to take this or if you wanna take it. – I’m actually gonna let Brooke take that, because Brooke is the machine learning practice lead. – [Brooke] (laughs) Sure, I’ll take this one. So the question was about what’s the road map from the machine learning modules and Spark? They’re referring to MLlib, and are there other type of integrations you can do with Spark, taking bench with tools like NLTK, TensorFlow, et cetera? So when we talk about MLlib, that’s the umbrella term we use to refer to Sparks distributed machine learning library. There’s two different machine learning packages, technically, there’s MLlib, that’s the old one, based off of RDDs, and Spark ML, which is the newer one based off of DataFrames. So Spark ML allows you to train models on massive amounts of data, but it does have a few drawbacks. Notably, it doesn’t have all of the same algorithms that are supported that scikit-learn does. It doesn’t have any deep learning modules like TensorFlow. But you can still take advantage of Spark for machine learning by being able to do things like model inference at scale. You can load your model onto each of your workers and then do inference across your data. You can also use Spark for things like distributed hyper perimeter search, so you’ll build a different model on each of your workers, and then you can use a tool like Hyperopt to then evolve your hyper parameter search for each subsequent models. So even if you are building single machine models or even if you are using pandas, you can still take advantage of Spark.
– [Kelly] Thanks, Brooke. – [Brooke] Yep, and then other folks are just asking can we have another webinar on Spark optimizations? That’s a great suggestion, we’ll put it in the queue. There’s also tons of great talks from Spark Summit on that one. – Yeah, that’s something, if you just search Spark Tuning, old Spark Summit talks are all available on YouTube, and there’s some really great, really in-depth Spark Tuning talks.
– [Brooke] And I think we’re at time for other questions we could answer live right now. Karen, do you wanna help us wrap this up?
– Sure, I was actually just looking for the link to register for our Spark Summit, which I just dropped in the chat. It’s virtual and free this year, so I highly encourage everyone to join if you can.
So great, thank you so much Kelly, thank You Brooke and Denny for TAing. And thanks everyone for joining us. After joining through Zoom, you’ll get a email within 24 hours with the recording, and it links to a few other resources that were mentioned. And, again, the video will be available on YouTube, so I’ll point to that in an email as well. So thanks, everyone, for joining,