Data scientists face numerous challenges throughout the data science workflow that hinder productivity. As organizations continue to become more data-driven, a collaborative environment is more critical than ever — one that provides easier access and visibility into the data, reports and dashboards built against the data, reproducibility, and insights uncovered within the data.. Join us to hear how Databricks’ open and collaborative platform simplifies data science by enabling you to run all types of analytics workloads, from data preparation to exploratory analysis and predictive analytics, at scale — all on one unified platform.
Austin Ford: Welcome everyone to this session, Learn to use Databricks for Data Science. I am Austin for a member of the product management team here at Databricks, and I’m going to quickly walk through a brief intro to the Databricks platform before turning things over to my colleague, Sean Owen, who’s going to work his magic and show you how awesome doing data science in Databricks is. So diving right in, data science is a tough job, we all know this. I used to be a data scientist myself. I know how hard it can be. Today, companies are becoming more and more data-driven. The ones getting the most out of their data will be the ones to succeed in an increasingly competitive landscape, and as a result, data science has become a core capability of many businesses. Unfortunately, it comes with a challenging and complex workflow that only becomes more challenging at scale.
So what does this workflow look like? If I’m a data scientist, I always start a project with a business question that needs to be answered with data. Before I can get started with the work of doing data science though, I need to make sure that I have properly set up my development environment. To do this I need to do three things. I needed to be able to find and access the right data to do my analysis, I need to get a correctly sized piece of compute for my task, and I need to get my toolbox ready with the packages and libraries I’ll need for my work. But that’s just the beginning, once the overhead of setup is complete the real work is going to begin. I start by exploring the data I’ve collected so that I can familiarize myself with it and form some hypotheses, and then using methods like statistical inference, visualization, modeling, I’ll uncover insights about the data. And then I’ll finish the analysis by synthesizing those insights and results into answers to the business question that started the project.
The most important step in the whole process comes next, sharing my results with my stakeholders, so that all my work, all that time I spent, has an impact on the business. To do this, I’m going to package my work into a report or a dashboard that my stakeholders can easily consume, and I’ll share it with them by email or slack or some other channel. From there, I get a feedback about the work and iterate on it with those stakeholders to drive the biggest impact I can.
So how does Databricks work with you through this workflow? Our answer is the Databricks Bakehouse platform, a unified place for all your data and data analysis. With it, we want to remove all this overhead of setup and configuration so that you can focus on the most important part of your work, the data science. The Databricks Bakehouse platform is unique in three ways. It’s simple. All your data comes together in one place so that all data workloads happen on one common platform. It’s open it’s based on open source and open standards to make it easy to work with existing tools and avoid proprietary format. Finally, it’s collaborative. You can work with other data scientists as well as data engineers and analysts to deliver insights that have an outsized effect on your business. Putting this all together, this is why Databricks is the data and AI company. No one else can do what we can do with such a simple single solution.
So in this session, like I said, we’re just focusing on the data science part of the Bakehouse. Please check out other sessions for deep dives into other parts of the platform. So how does Databricks work with you when doing data science? First off Databricks makes setup easy. All your data sources are in a single place. The Bakehouse brings them all together so you don’t have to go digging. You can easily choose the right compute resource for your task. And that can be anything from a single machine VM, to a VM with a bunch of GPU’s, or a big spark cluster, and you can switch between them as needed. Finally, Databricks comes with runtimes that package together all the standard open source data science tools, and it’s easy to customize your environment to add new ones.
Continuing on the theme of open source tools, Databricks brings together all the open source standards for data science development. The Databricks notebook supports multiple languages, from Python to SQL or Scala, and it enables you to work with your colleagues in real time with co-presence in our notebooks, commenting in our notebooks, and co-editing. The notebook also comes with built-in Plotly visualizations, which you can use to quickly derive insights from your data with just two clicks. Finally, the notebook is also integrated with Git and the most popular Git providers to support finder grain version control.
Lastly, when you’re finally finished with your analysis and all that hard work, Databricks helps you to share your results in the best form for your stakeholders. You can easily share your notebooks as reports, just by giving those stakeholders access. You can also quickly create a dashboard directly from the output of your work in the notebook. Lastly, you can make that data that you process any new data you may have created available to BI tools like our own SQL analytics, Tableau, Power BI and others through Databricks SQL endpoints.
Now to get practical and show you how it’s all done, I’m going to turn it over to my colleague. Sean, he’s going to walk you through a hands on deck, take it away Sean.
Sean Owen: Great. Thanks Austin.
Hello. Welcome everyone to the Spark Summit, and as Austin alluded to you, I’d love to show you how to do data science on Databricks. So as Austin alluded to you, we’re not going to do machine learning here. We’re not going to use ML flow here, we have other sessions for that. We’re not going to do data engineering either. This session is directed at people who are maybe working in Python or R to analyze data, to get insights from data, maybe fit statistical models to data as well, traditional data science. And I want to show you that you can do these things on Databricks in the same way you do it anywhere else, but we provide some enhancements to maybe make your life easier or help you scale up a little to larger datasets. So let’s do it.
First of all, for those that are new to Databricks, this is Databricks. This is the books you’ll be looking at, the environment you’ll be using, if you use Databricks. It’s a notebook-like environment, so where you can write code, execute code, see the results, and of course write markdown cells that document what you’re doing as well. And I’ll have to tell people that there’s really not much more to see in Databricks if you’re a data scientist. You don’t really have to deal with things like where’s my data, where are my compute resources. In this case, I’m just going to connect to a shared Spark cluster here. Even though some of the stuff I’m doing is not going to be using Spark, you as well may be in an environment where you just connect to an existing cluster. You don’t even have to think about the compute resources in the cloud.
They’re there. They turn on if they’re not on, they turn off if they’re not used. They scale up, they scale down, you just really don’t have to think about it. And likewise, for libraries. So you are probably using a lot of common libraries in Python or R. So in our runtime, in our ML runtime here, which I’m using here, most of the libraries you probably are using are already installed for you, whether it’s an R, Python, Scala, etc. You don’t have to worry about it. In this case, I actually am later going to use a few libraries that are not pre-installed in the runtime, one is called Folium. And I also want to update the version of Seaborne that I’m using that want to use some really new features of a package called Seaborne. But the message here is that’s really easy to do as well.
You can just PIP install those libraries as dependencies, and they only affect your notebook and your execution. So there’s no worries that you’re changing the environment for other users of some of the shared cluster. Even if you’re sharing a cluster, that’s fine, this just affects your notebook. So let’s have some fun with some data here. I’ve chosen the dataset here about New York city taxi rides in 2013. And there’s a couple of these datasets floating around. I picked one that I, that I like here, and it’s not a complicated data set, but I think we can actually have a fair bit of fun with it. There’s actually a fair bit to see, even in the simple dataset. So this session is oriented towards users that are maybe a little bit new to Databricks, maybe coming from an environment where they’re using Python or R are on a single machine and are wondering, how is Databricks different? Do I have to change how I work? What those Databricks provide for me?
And my first message to you, starting with the Python users, is that you really don’t have to change your workflow if you don’t want to. So one of the first things you might do, if you’re a Python user, is take this dataset, which is a bunch of CSV files, and read it with pandas. That’s fine, you can do this exact same thing in Databricks. That’s what I’m going to do here to start here. I’m going to read, there’s 12 files here, so we need to put them together. Now the only thing this highlights about pandas as bad is that pandas does not distribute, by itself, across a cluster of machines. So it’s a little bit hard to analyze really large data sets with pandas. So you’ll notice here that I’ve actually selected a 1% sample of the data, and that’s because the entire dataset would take about 80 gigabytes to fit into memory, then maybe we could do that. We could get a big enough machine to do that, but Hey, this wouldn’t really work if the dataset was a terabyte, right? But fine, let’s start with maybe what you know, what simple. Let’s use pandas to see how far we can get with the data.
So if you load a data set in pandas in Databricks, that’s fine. We can do that and we can render it nicely for you here, and we see already a little bit about the dataset. We have things like the hash, medallion, and license number of the taxi, when a particular ride started and ended, how many passengers there were. And if we scroll to the right, we see it also includes things like how long it took, the distance, and the pickup and drop off, latitude and longitude. Pretty interesting, there’s some interesting stuff here. So any good data science process might start with just taking a look at the data? How is it distributed? And we can use pandas to describe the data, and we already see some interesting stuff here.
If I may point out that, well, the minimum passenger count of any ride is zero. There were some rides that had zero passengers. Not really sure what to make of that, maybe it’s legitimate. But what I won’t believe is that the minimum time, the duration of a taxi ride, was negative ten seconds. I don’t think that can happen. Also, if you look at the maximum duration of a taxi ride in this dataset, well, it’s about two months. And I don’t know, I’ve had some long taxi rides, but they weren’t two months. This looks suspiciously like two to the thirty-second milliseconds, so maybe this is just a data error here. Likewise, if you look at the distance, there’s one trip in the data set of nine million miles. That’s about 360 times around the earth, so I think that’s probably wrong. And if you look at the latitude and longitude as well, latitude can only be plus or minus 90 left, longitudinal will be plus or minus 180. So some of these minimums are obviously wrong.
And even putting that aside, some of the points in this data sets, these pickup and drop off points, are nowhere near New York City. And I’ve known people that have taken taxis for hundreds of miles outside of New York but probably not thousands of miles. So there’s obviously some problems with this dataset. As a pandas user, you might begin by just filtering out the data points that are obviously wrong. We can do that in pandas, right? No problem here. And having done that, we can start to explore the data further. So for example, I think pandas users will know that it integrates nicely with matplotlib, provides some nice plotting functionality out of the box here.
Then maybe I want to take a look at this trip time in seconds. I think this is going to be a key column in the data that I want to look at more. Maybe this is the business question I’m interested in exploring, like what does trip time in seconds look like? How long are taxi trips on average? So we might start exploring that by plotting the distribution of this from my sample, from the sample I’ve loaded with pandas here. And that makes some sense. I think obviously it’s okay, having filtered out the negative values, none of them are negative, and this distribution looks log normal to me. So maybe later on, I’m going to fit a distribution to that. And if you’re noticing some weird peaks here, I’ll come back to that later. There’s something going on with this dataset, but maybe this the key feature of the data that I’m interested in as a business user, how long were taxi trips in general.
But sure. All the things you’re used to using and pandas work as well in Databricks, so no problem there. We might also want to figure out which of the taxis are most busy. So taxis are identified by a medallion. If you’ve been in New York City, you’ll know that taxis have this alphanumeric code on top of them like 3Y90, and those values are not present in the dataset, they’re hashed here. So we have the hash of the medallion, but we can still of course aggregate over that value and see which taxis in New York City are providing the most rides here. We can do that this way, but it’s a little bit unsatisfying because of course, this is just a sample, so these counts are not accurate. We’ve just sampled 1% of the data, and we counted this medallion was the most busy one, according to the sample. But if we’re going to go further than this, maybe we want to up our game and look at the whole data set. Then we can do that.
We can do that. We have a cluster of machines available here in Databricks with Spark. So why don’t we switch gears a little bit and see what we can do to analyze the whole data set without hitting the limitations of something like pandas. Well, that’s where a tool like Koalas comes in. So if you are a pandas user, you maybe have a bunch of pandas code you’ve already written, and you don’t want to rewrite it. Of course we can rewrite a lot of these operations in terms of PySpark, which has a similar but different API, but maybe you don’t want to do that. You don’t want to learn PySpark that’s where Koalas comes in. Koalas is a re-implementation of most of the Panda’s API on top of Spark.
So that exact same code we’ve been executing, we can execute through Koalas, through Spark, on the entire dataset, and that’s actually about 80 or so gigabytes, with really no change to our code. Now the one thing that does change, yes we have to read the data a little bit differently, here I’m reading it directly with Spark. That’s a little bit simpler and I’ve thrown in some casting just to make the data look a little bit nicer. This is not strictly necessary, but I did it just to make the example work better. But we’ve read the data with Spark here and having read the data with spark, but we can of course do the same sorts of things. We can call the Spark version of describe to summarize the data. And if we look at the summary, it’s a little bit different. It gives us similar ideas though.
For example, the apparent minimum trip time in seconds is not just minus 10 seconds, then minus 6,480 seconds or almost two hours. So I don’t know what’s going on there, but someone’s gone back in time. So if you look at the whole data set, you can even see more outliers here. But as I promised, maybe we want to do this work without rewriting a bunch of code and understanding PySpark APIs. So let’s use Koalas to actually do the exact same analysis that we did above. So here I’m transforming this Spark DataFrame into a Koalas DataFrame. It’s a wrapper on top of the Spark DataFrame that gives us methods and syntax that work like pandas. So this is the exact same line of code executed above on this pandas sample of the data, and it works as well. So now I’m executing over however, the entire dataset.
And indeed I see the same sorts of interesting artifacts. The worst trip time in seconds is much smaller. And boy, some of these longitudes and latitudes are really quite crazy here, but we can execute the same code to filter it here. This is again, I assure you the exact same code I showed you above. This is pandas syntax, but it’s implemented in terms of Koalas, so this happens across a cluster on the entire dataset. And we can also render a plot, a histogram of that trip time in seconds, which is one of the key features we’re interested in. Now in Databricks, if you do this, even though calls is open source and you can use it anywhere else, like a lot of open source things we can make Koalas run a little bit better in Databricks. And so in DataBricks, if you plot something with Koalas, we can detect that and render a nicer Plotly based histogram from this data.
So you’ll see that this is actually a little bit interactive here. I can mouse over the dataset and explore it a little bit more. And the nice thing is we’re actually looking at the histogram of the whole dataset. So maybe we can have a little more confidence, and that what we’re looking at is the true distribution. And I’ll tell you that something interesting emerges if you make this histogram a little more fine grained. Because we have the whole data set, we can do this and see something more interesting. What if I up this to a thousand to slice these spans a little more finely? Well, this happens here. So you see that there’s actually two distributions going on. Let me zoom in on this. Yeah. So this distribution of frequencies of trip times in seconds has these very clear and regular peaks.
And if you look closely, you’ll find that these peaks occur at multiples of 60 seconds or one minute. So if you spend a lot of time in New York like I did before 2013, you may think that yes, New York City tactics measure their time and their fares in terms of minutes, maybe newer ones can measure them down to the second and bill to the second I’m not sure. But this is quite apparent from the data, a lot of this data seems to be reported with this trip time feature quantized to the nearest 60 seconds or so. And that becomes obvious if you can plot this and hey, that’s an example of a data science. But given the tools, given easy tools to look at the data, we can see these things and at least work around them, or if we need to go upstream and fix them. But yes, let’s be aware that this column, yes, the data’s good and in the main, but a lot of it is rounded to the nearest 60 seconds, obviously.
So we can go back and do the same thing again, same pandas code executed through Koalas across the entire data set, so we can efficiently do counts by medallion here over the whole data set. And we see that some medallions, some taxi numbers, were really quite busy in 2013. In fact, I’m not sure, I believe it. Apparently one taxi medallion delivered almost 22000 trips in 2013. If you do the math, that’s about 60 a day, that’s five an hour for a 12 hour a day. That seems kind of unlikely. And I won’t show it here, but I went and broke this down further by license and medallion and both in the same time, and you still come up with a maximum that’s really quite high. So I’m not sure what’s going on there. Maybe if you were the New York City Taxi and Limousine Commission, you might look at some of these to see if people, maybe are several different distinct people, are sharing a medallion or a taxi license.
I’m not even sure that’s a problem or not, but this seems to be pushing the bounds of what is physically possible for taxis in New York City. And we can also, if we like, flip on the built-in visualization in Databricks, which I’ve shown you a little bit above to plot this data again, another histogram. We don’t have to use other tools, we can just access these Plotly-based visualizations in Databricks if we just want to take a quick look at the data. And so if I plot the distribution of rides by medallion and the counts, well I see this distribution, that it’s normal-ish but obviously it’s truncated at the right, because there are physical limits on how many rides a taxi can possibly provide in a year. But you al. so see this nice peak at the left here, and this is not zero. This is one, if you look at the data. So why might that be? I’m thinking that maybe there are rules that say if you don’t use your medallion at least once a year, you lose it.
So maybe there’s some incentive for a lot of taxi medallions to take a ride at least once a year and only once a year, and so maybe that’s what’s going on here. I don’t know, but if I were a data scientist and I were exploring this data on behalf of New York City, then maybe this would be, these medallions, would be of quite a bit of interest to me. And it’s nice that I can see that pretty easily just by creating the data and firing up a quick visualization of it.
So moving on, I think I’ve addressed the Python users here, but of course I think a lot of people in the audience are probably SQL users as well. You’re used to acquiring data with SQL and you may be wondering, what does this tool have for me? Well, the good news is of course you can query data in SQL in Databricks as well. I think as Austin alluded to, Databricks has an entire BI like environment called SQL analytics. I’m not going to show you here, but I’ll give you a quick peek at it here from our documentation. So rest assured that if you’re a SQL user and want a BI like environment, we have that in Databricks. But I want to show you here that you can do that in line in a notebook as well. So I’ve been creating data with Spark, maybe via Koalas here. But if I like I can, I can switch to SQL or a different language, even within the same notebook.
And I’m going to switch to SQL here to query this DataFrame, this Spark result that I’ve been forming here. And I can do exactly the same things if SQL a more comfortable language to express the queries you want. So here I am issuing the same query, show me the counts of rides by medallion, and I get the result here as a DataFrame. And as I say, later I can turn on the visualization of these and construct similar plots. But of course I think SQL users out there are not just used to SQL, they’re used to database systems and they want to work in terms of tables and databases. And you can do that too.
So the result I’m querying here is not a table. It’s it’s a Spark DataFrame. It’s the result of transformations I’ve expressed with Spark. But if I like I could persist that results to disk, to distributed storage in your storage account, and then register that dataset as a table in a database. So that’s what I’m doing here. I’m going to save this dataset as a Delta table. Delta is an open source storage engine from Databricks, and it builds on Parquet to add some additional features like merges, upserts, deletes, transactions, and time travel, to let you query the state of the data at a previous point in time. So it’s a good choice, I think, if you’re working in Databricks, but you really don’t have to, you could these as parquet files, is [inaudible] files, JSON, it really doesn’t matter. You can read or write anything Databricks.
But I’ve written it as as Delta. And I’m going to register this table as a table in the Metastore here that’s built into Databricks. And if I do, this becomes visible here in my Metastore. Here we go, and that helps discovery. So other people now can access this data from this database, from this table, even though it’s really just files on disk on distributed storage in an open format. So for, for SQL users, for database users, rest assured you can do these things. You can still have concepts of databases and tables and data that’s persistent and so on. And if I do that, of course, well, I can continue with SQL here, even in line at a notebook and issue different queries. Maybe I’m interested in the distribution of passenger counts across rides here. So if I break this down by vendor and passenger count, I see, well, number one, strangely enough, there’s quite a few rides you recorded in 2013 without any passengers.
Not really sure why that is. I’m not sure it’s invalid, but I don’t. I can explain that typically taxis have passengers, but I might want to break this down and visualize these counts by vendor and passenger count as well. And I can, of course, just simply flip on the built-in visualization here in Databricks. And I get something like this and immediately something pops out at you here. For these two different vendors, their distributions are different. So CMT is a mobile knowledge systems, and I think this is the company that runs the hardware for your standard New York taxis. And indeed, for those in blue, you see that a well as expected most of the time taxis had one passenger or sometimes two rarely three or four, and really not more than that, but for taxis operated by VTS, which is Verifone Transportation Systems, you see the spike at five and six.
I think having lived in New York, I believe that’s because Verifone operates a service where you can pre-book much larger taxis, like vans, to transport more than four people that could possibly fit comfortably in a taxi. Maybe that’s interesting, maybe it’s not, but these are the kinds of things that are easy to see immediately from this data set, if you just load it and query it with SQL, and maybe plot it with built-in tools here in Databricks. So let’s switch languages again. I imagine there’s some people in the audience who are R users. R is a rich statistical environment, and it’s one of the tools that people might most often reach for to do more advanced statistical analysis of data. And the good news is in Databricks R as available as well. So when you run a notebook, there is an R interpreter available as well. If you wish to use it, to also create the data and transform it.
So here, I’m going to switch into R, I can do that within a notebook here by switching with this magic command. And I’m going to load SparkR, the library that lets me access data by SparkR, maybe stored out there as Delta tables, in R and that’s what I’m doing here. And it’s really just the same thing. So if you’re an R user, all the stuff we’ve been doing in Python, you can do in R as well. There’s equivalent libraries to manipulate the data at scale with SparkR. So here’s our dataset, but again, as I say, I think we’re interested in this trip time feature. As I said above, I think it looks log normal, and I want to drill down on that question. Is it actually log normal? And if so, what is the log normal distribution of trip time in seconds.
Now I know how to do that in R, I might know how to do that in Python if I looked up the equivalent methods and stats models or something, but I definitely know how to do it in R so why don’t I just do it in R? So what I’m going to do here is, well, again, I need to take a sample of the data because we’re working in R and this datasets huge. So I can’t bring the whole data set down, but I don’t really need it to maybe look at the distribution of trip times. So I’m just going to take a random sample with Spark and pull that small sample down to as an R DataFrame. And from there, I can work with this data set as I would any other in R.
So if the data set is log normal, then the logarithm of the data points should be normally distributed. Then I can assess that with a Q-Q plot, standard stuff in R. We do actually have built-in visualization for Q-Q plots, but I wanted to show you how to do it in R because I think a lot of users out there are probably accustomed to doing this in something like R, and I just want to show you that it works exactly the same way. So here, I’m going to make a Q-Q plot of the logarithm of trip time. And I’m also going to fit the theoretical fit line for this data, so I can maybe assess whether this looks normal or not. And here’s what I get. It’s going to be very familiar for people that have worked in R or RStudio. I will say, as an aside, we do have an integration with RStudio as well for the R users. You can run RStudio in Databricks and use that with the Databricks cluster, but here I’m just using R directly in the notebook, that’s fine too.
And what do we see? Well, this blue line tells us what the data should look like if it really is perfectly log normal, and it’s not quite. It’s notably light tailed on the left end here. And maybe that makes sense because, well, it’s probably unlikely you take a taxi ride that takes one minute or 30 seconds. So even though this distribution might predict that there’s some of those, those are disproportionately unlikely in practice because you can just walk, or also, there is a minimum fare for a taxi ride. It was $2 in 2013 before, when I was there, it’s $2.50 now. So that I think maybe explains why this is actually in the main log normal, but it doesn’t quite fit at the low end tier.
These jumps to the left, I think that these are again explained by the quantization I referred to earlier, a lot of taxi trips are rounded to the nearest 60 seconds. And so I think that explains these kind of odd discontinuities here. But the fit’s pretty okay. Maybe I’m interested in the fit because maybe I want to model the distribution of taxi ride times, and I can do that with a lot of libraries. I can definitely do it in R, I know how to do it in R, I don’t know how to do it in Python, so let’s just use R to do it. So I’m going to load mass here, this is pre-installed, and I’m going to fit a log normal distribution to it, and then just plot my dataset bins by frequency with the distribution fit over it. And that looks pretty good.
I don’t know. It seems like a reasonably good fit. There are these weird discontinuities in the histogram as we’ve discussed earlier, but this is a pretty reasonable fit. So this is the kind of analysis you needed to perform, and you know how to use R to do it. No problem. You can just use the familiar tools and R, and Spark is just there to help you load and manipulate the data. Okay, so there’s one piece of this data I have not talked about yet, which is the location data. So this data set has pretty rich data about where these trips started and ended. And what do we want to do with that? Well rather than analyze them statistically, maybe we just want to look at them. We want to see where trips started and ended on a map. Now, a tool like Databricks does not have built in tools for plotting location data as a heat map on a map, but that’s fine because the larger open source ecosystem out there does, and the library that can do that easily, one of them is called Folium.
And that’s one of the libraries I installed earlier into our runtime because I want to use it here. So Folium can do a lot of things, one of the cool things it can do is plot location data as a heat map over actual maps of the world, from Google Maps. And because I can easily install it, because it can accept data and say pandas format, and because I can easily grab a sample of data as a pandas DataFrame, these things are easy to put together here in Databricks. So here I am going to read some of this location data. I’m just going to pick up a sample because again I don’t need the whole data set and it’s too large anyway to pull down, but I don’t need that to get a sense of where the pickups and drop offs are.
And then I’m going to use Folium to render a plot that shows me where the pickups and drop offs are. And it’s pretty cool. I can zoom in, it’s interactive. I can take a look at where people are. These are pickups by the way, and so obviously it makes sense that people tend to get picked up on corners of intersections, more on avenues, and less on these small cross streets. So that’s pretty interesting, and maybe I want to share this more widely with the organization, but I really don’t want to share this entire notebook. We’ve done a lot of work here, there’s lots of complexity here, and maybe I want to just share this output with someone so they can explore it themselves. So you can do that too in Databricks. You can export the output of one or more cells as a dashboard, and I’ve done that here.
So I’m going to pop this up here, and I can share just this. I can basically share just the output of one cell. And actually I didn’t comment on this, but I instrumented this cell with a widget that takes as input the hour of the day that we’re interested in. So maybe I want to look at what the distribution of pickups like at 8:00 AM versus maybe later at night. So I can do that and I can, if I instrument it with a widget like this, as the widget changes values, the cell executes the plot re-renders. And so you could share this as a lightweight interactive with tool with other users, maybe they’re data scientists to go explore. So 8:00 AM, let’s see what’s going on. I used to live, a long time ago, down here on the lower east side, and there’s not a lot going on.
There’s not a lot of people taking taxis from the lower east side at 8:00 AM on any given morning. But if I flip to my goodness, maybe 1:00 AM, maybe the picture changes. Let’s take a look. So I’ve changed the one we’re going to rerun this whole query. We’re going to re-generate the plot, and now I can zoom in again and let’s see it. Let’s see what’s going on the lower east side at 1:00 AM in general. I suspect it’s a little bit busier. Zoom over here, I can already see there’s a little more activity, but yeah, indeed. I mean, I think anyone that may be familiar with this area would agree that it’s more likely that people are taking taxis from this area at 1:00 AM than 8:00 AM. So maybe that’s interesting. So with a little bit of data science, use of nice open source libraries like Folium, you can produce some pretty interesting interactive visualizations and applications here in Databricks, which could be useful to the organization.
The last thing I’m going to show you as a bonus rounds, let’s talk about those hash medallions. So sometimes data science is not just a matter of competing aggregations and fittings statistical models. Sometimes we have to go a bit off road, write some code, do something a little more unique, to pursue a particular goal here. And I want to show you something interesting, maybe unfortunate about this dataset that a friend of mine noticed over 10 years ago, eight years ago about this dataset, and his name is VJ. And he looked at this data set and notice that yes, these medallions and these hack license numbers were hashed. They were hashed in order to protect privacy. They didn’t want to publish the actual medallion number involved in each transaction. So they hashed them. And that’s a good idea. I mean, hashes are generally not reversible, but they’re at least deterministic.
So at least we could do things like see that there was a medallion ID that had almost 22,000 rides in a year in New York without revealing what the medallion was. However, there was a problem. So he noticed that if you looked at the hash values, you saw some kind of strangely familiar values here. And he quickly figured out that what they must have done is simply apply MD5 hash to the strings, to these medallion and hack license strings. Now that’s not a terrible idea, but it was pretty obvious. And he figured this out not because hashed the empty string, and I think some developers out there will recognize the hash of the empty string MD5 hash, but a lot of the values turned out to be zero. And the hash of string zero is this that you see here, and this was the giveaway about what they’d done. Now that’s still fine because the hash is not directly reversible.
However, there are only so many possible medallion IDs and hack license IDs. If you’ve been in New York, you know that the medallions look like 3Y90. There’s only so many of them, they follow a particular format. So it’s entirely possible to generate all of them, hash all of them, and then generate a reverse lookup table. Now that could be computationally sound, computationally difficult, but it’s not because there’s only about 20 million possibilities and this you could do on your laptop. But just for fun, I reported this to Spark and here in this bit of code, and I’ve had to write some more lower-level Python code to do this, but I can do this in Spark because Spark’s available, and in just about four seconds I generated all the possible un-hashing of all possible medallions and hack licenses in New York City like so. Given that, of course that’s a data set that I can join on the original hash data to tell me what the actual metallic was for each of these rows.
So I’ve done that here, generated about 20 million possibilities, joined it here and yeah. Voila! Even though we start with hashed hack licenses and medallions, we can easily uncover the original value. Now I’m not suggesting you should do this. I’m not suggesting this is the goal of data science, but if you need to do something like this, it’s nice to have the tooling available here. I can write lower-level Python code and distribute it, if I need to throw a lot of compute at a problem like this. It’s also a cautionary tale. If you are hashing data, to release it to the public and your dataset, please don’t just do this. Please salt your data and then hash it at the very least. Because we have a bit of time, I’ll tell you as coder here, unfortunately, people picked up on this and found photos of celebrities that were maybe getting out of a taxi at certain locations at a certain time, and used this dataset to go work backwards and figure out what the source of the trip was.
Because if you know, that person got out a well-known place at a certain time, and you can read the medallion out of the photo, then you can do something like this to figure out what trip it was and work out where the trip started. And there’s actually a different dataset here that includes things like tip. So you can work out which celebrities are good tippers or not. And that was unintended, of course. And I, again, I’m not suggesting this is something you should do, but for the resourceful data scientists, it’s the kind of thing you absolutely can do or should be careful about not allowing in your own analysis. With that, thank you very much. I hope this has been a useful overview of what Databricks looks like for the data scientists who’s wanting to summarize data, to visualize data, to maybe fit statistical models, maybe coming from the world of R and Python. And I hope we’ve shown you that you can do the exact same work you do already, but we just offer you some more possibilities to scale it up.
Thank you very much.
Sean is a principal solutions architect focusing on machine learning and data science at Databricks. He is an Apache Spark committer and PMC member, and co-author Advanced Analytics with Spark. Previo...
Austin is a product manager on the Data Science team at Databricks. Prior to joining Databricks, Austin built products for data scientists at Domino Data Lab and a learner-centric K-12 education platf...