Implementing tools, let alone an entire Unified Data Platform, like Databricks, can be quite the undertaking. Implementing a tool which you have not yet learned all the ins and outs of can be even more frustrating. Have you ever wished that you could take some of that uncertainty away? Four years ago, Western Governors University (WGU) took on the task of rewriting all of our ETL pipelines in Scala/Python, as well as migrating our Enterprise Data Warehouse into Delta, all on the Databricks platform. Starting with 4 users and rapidly growing to over 120 users across 8 business units, our Databricks environment turned into an entire unified platform, being used by individuals of all skill levels, data requirements, and internal security requirements.
Through this process, our team has had the chance and opportunity to learn while making a lot of mistakes. Taking a look back at those mistakes, there are a lot of things we wish we had known before opening the platform to our enterprise.
We would like to share with you 10 things we wish we had known before WGU started operating in our Databricks environment. Covering topics surrounding user management from both an AWS and Databricks perspective, understanding and managing costs, creating custom pipelines for efficient code management, learning about new Apache Spark snippets that helped save us a fortune, and more. We would like to provide our recommendations on how one can overcome these pitfalls to help new, current and prospective users to make their environments easier, safer, and more reliable to work in.
Jake Kulas: Good morning, afternoon, and evening from wherever you’re connecting from. My name is Jake Kulas and I’m from Western Governors University. Today we’re going to be talking about 10 things learned releasing Databricks enterprise wide. For beginners, intermediate, and possibly even some advanced users, business owners, tech leaders, and even developers, taking just a few of these lessons and trying to implement them can help alleviate any growing pains you can be seeing trying to release Databricks enterprise wide. Today we’re just going to be doing a brief speaker background, what is WGU, our current implementation architecture, and finally our 10 things learned.
First off, as said before, my name is Jake Kulas. I’m a senior big data developer and data engineer at Western Governors University. I’ve been at WGU for about nine years now. I moved from Wisconsin to Utah about 12 years ago, mostly for skiing, but a little bit for school. I have my bachelor’s and master’s in information systems from the University of Utah, where I had a focus in data management, data storage and data architecture. I’ve been working with Apache Spark and Databricks for about four years now. Some big projects we’ve worked on recently have been working on rearchitecting our enterprise data warehouse, our data lake architecture, administrating over 220 users on our Databricks environment, and doing tons of trainings, best practice sessions and local speaker sessions with Databricks and by myself.
What is Western Governors University? Western Governors University was founded in 1997 by 19 US governors. Our whole motto is education without boundaries. So how can we ensure more of our residents have access to college education that fits their schedule? We’re a nonprofit, all-online, competency-based education system. That means that we measure skills and learning rather than time. We provide undergraduate and graduate degrees for nursing, business, IT, and a whole other bunch of colleges. We’re regionally and nationally accredited, and we currently have eight state affiliates. We have graduated over 228,000 students and currently have 135,000 plus active students, making us one of the largest universities in the country right now. We serve 8,000 employees that span all the way across the country, but mostly in Salt Lake City, Utah.
Why did we want to implement Databricks? Well, our implementation reasoning is mostly for two main purposes. First, we wanted to unify our data platform, our data engineers and their ETL pipelines. Our analysts and researchers and their reporting and their ad hoc needs. Our data scientists and their machine learning. And our psychometricians and statisticians and their modeling. We also wanted to rearchitect our entire enterprise data warehouse and our newly created data lake house on the Delta IO architecture, which originally shipped out with Databricks.
Doing this implementation though, brought up some key challenges, and that’s where we have our implementation without education. We had new languages like Scala and Python that a lot of us didn’t have tons of experience with. We had a whole new platform like Databricks that we had no idea how it worked. We were brand new to cloud architecture, so working in AWS had tons of internal difficulties for us, not including our networking teams and other operational teams that had to help us set some stuff up.
Finally, we wanted to roll this out to our entire enterprise, eight business units of analytical users, over 140 current direct users at that time, and converting over 300 jobs. Doing so, we ended up deciding on this current implementation architecture. We have a lot of source systems coming from things like Qualtrics, Salesforce, on-prem and cloud databases, Adobe Analytics, Banner education systems, and tons of different flat files. We use ELT And ETL and batch and stream jobs to bring this data into our data lake. That hydrated data lake then is used for machine learning analytics, reporting and all other types of ad hoc analysis.
Data from Databricks, as well as the data lake itself gets then populated into Tableau, Power BI, IBM Cognos Analytics and other source systems that we use to report our [inaudible], our reporting needs.
Some big key mistakes and challenges that we really hit along the way fall into four major categories, and each one of these categories has our lessons learned. First off, understanding our use of Apache Spark and Delta. Shiny new toys can really be hard to control, so that’s where we learned about optimizing JDBC, Delta optimizations and multilingual empowerment. Then our code management and our new environment. How can we manage our code in this new notebook based architecture? That’s where we ended up investigating CICD and the ability to reduce, reuse and recycle our code.
Next is cost management. So really talking about job and cluster management and wow, those clusters can get really expensive, fast. And then finally, our user management. So talking about user groups, cluster segregation, leveraging Databricks secrets, and then training and best practices. How can we protect ourselves and others from our own users and how can we make this environment the best for our users to work in?
Lessons one through three. These deal with understanding Apache Spark and Delta. Optimizing JDBC, Delta optimizations and multilingual empowerment, are three types of lessons we learned here.
Why are our reads so slow? This is the very first question we had to ask ourselves. We had just implemented Databricks and we were running into a lot of issues bringing data in from our original store system, Oracle, where our enterprise data warehouse currently existed at the time. We were missing the understanding of some of the key properties that the JDBC read process utilizes.
In our example here, you can see how we use these properties to do a sample query. In this query we’re reaching out to our Oracle database where we’re selecting from a small table called student. We’re just getting a student ID and a student’s name and address. We set the fetch size to 10,000. That’s essentially how many rows per round trip that we want to do. We can expand this to really almost any value. However, expanding this up to maybe 50,000, 60,000, can really start to cause some performance loss on our current system. So we ended up deciding that between 10 and 30,000 for us, was a great fit for our Oracle system.
Next, we’ll talk about the number of partitions. These are how many concurrent parallel connections then are going out to that database. That all depends on how many cores we have. If we have a four core machine and five workers, that means we can do 20 parallel reads at one time to the database.
Next is our partition column. In our case we’re using student ID, which is a primary key for this table and it’s sequential and fully unique. Next, we’ll talk about our lower and upper bounds. This is the minimum and maximum values of that partition. That’s how our data’s [inaudible] is going to end up being split up. By utilizing all of these key properties, we were really able to bring efficiency to how quickly we are able to read data from our Oracle instance and other JDBC type databases.
Next we’ll talk about Delta optimizations. We have Spark and it’s really great and it’s fast, but if our tables are still slow, how can that help us? We really needed to understand how to fully optimize our Delta tables in order to take advantage of that. Optimizing is essentially coalescing your small files into larger files. However, it can be extremely expensive and taxing depending on what your table looks like and how frequent your tables are getting updated. We have two examples here in this code that show you how to do that. A full table optimize and an optimize based off a short period of time from the previous day, a job might have been run.
Also, we want to understand data skipping and this all depends on how your data is set up and there’s different methods of doing so. One of the big things we’re going to be focusing on is Z-ordering. A key thing to note here is that this feature is typically always activated when applicable and it’s automatically active on all Delta tables. In Z-ordering we’re essentially co-locating related information in the same set of files. So high cardinality columns really should be used here.
In our example, we’re using program code. We have about 200 different programs in this table and so it’s going to be a great way to co-locate that data for people to use in their predicate statements. That can also be optimized on a full table, as well as within a recent interval.
Finally, we’ll talk about partitioning. Partitioning is really useful for splitting your data into groups. Dates are extremely helpful in this instance, but also low cardinality groups. For our instance, maybe colleges. This all really depends on your data size. Having really large partitions though can cause some issues. As you can see in the chart on the bottom left, we had a use case where we were using click data. We brought that data into and saved it into Parquet files, and originally we’re seeing read times of 25 to 30% on some of our queries. When we push that data into Delta unoptimized, we’re able to bring that down to about 10 minutes or so. Finally, after we were able to go and utilize some of these optimization techniques, we are able to read that data of hundreds of millions of rows within a mere seconds.
Multilingual empowerment is our third lesson. Databricks allows us to use multiple languages within a notebook. It can really allow us to utilize what we know best in order to get our jobs done. For example, we’re reading some data in using Scala from a CSV file, doing some slight transformations and modifications, and then saving it to a temporary view. In our next cell, we’re using SQL to do some brief ad hoc analysis, maybe a visualization. And then we pass that [inaudible] on and read it in R, and do a model on it.
What this really allows us to do, is empower some less experienced users and analysts and engineers, or even people who are more comfortable with one language over another, to really use what they’re comfortable with. Mixing languages based on your task at hand can really be helpful in some instances. Whether that’s using R and SQL together, Scala, SQL and R together, or Python, Scala, and SQL together.
Lessons four through five relate to our managing our code. There are two lessons here. Implementing CICD and how we did that. And then also reduce reusing and recycling that code to make it more efficient for you and your enterprise. So CICD, there’s no real big defined way to do it. It totally depends on your architecture. We wanted to use Databricks and some of our AWS resources and all of the tools inside of that to try and come up with a CDC pipeline that worked best for us.
This started off by using folders and permissions inside of our workspace. So creating production-based folders and limiting those permissions based off of user groups. Those permissions included things like being only able to read notebooks, so no one can edit them. Small groups of people allowed to edit and run those notebooks in production. Some examples of folder permissions can be seen in the upper right.
The next feature we really wanted to utilize was our Git integration, which can be seen in the revision history section of a notebook. That’s typically in the upper right hand corner of your notebook. There you’re allowed to take your notebook and attach it to a repository inside of your source control system. In our instance, we’re using GitHub. By writing your code, you can then go and commit to your repository or revert your code and recommit. This empowers users to push code from their Databricks notebook environments, all the way to our source control in GitHub.
Finally, we wanted to utilize projects, which is now called repos API. The repos feature allows us to create workspace folders that allow us to attach those to our repository in our source control, whether that’s GitHub, Bitbucket or GitLab. That allows us to then take that repository and pull the specific branch from that repository and deploy those notebooks within that workspace. Combining all these features, we are able to create a proof of concept pipeline that we have been using to deploy some of our code.
A user starts off developing their code in Databricks notebooks. They open up their revision history. They commit their code and it saves it to our GitHub repository on the dev branch. They then create a pull request and that pull request shoots off messages to a manager or other users. Those manager and users can then go and approve that, and once approved, our AWS code pipeline picks up that repository modification. It then executes a lambda function, which calls the projects or now repos API to update that repository. What happens then is the repository then gets rescanned by the repos API and it updates the new notebooks and also updates already linked notebooks inside of that workspace. This allows us to then have a semi CICD process, all encapsulated within Databricks.
We have not yet investigated utilizing the clusters or the jobs API to create clusters and jobs in that process, but it’s something we look forward to doing in our version two.
Reduce, reusing, and recycle is a real great way to disperse your enterprise use functions across different notebooks. So essentially writing your code once and sharing it. You can do that by utilizing the %run command in your notebooks. In this example, we have a core notebook, which is then calling two additional notebooks, a configuration notebook, which includes a lot of our properties definitions. This can be things like S3, path URLs or some static variable definitions. Then we have our operations notebook, is where we define our functions and our methods. That notebook can use the prior one configurations and then once again, in our core notebook, we can utilize both of those notebooks to call a function, which then creates a data frame in our core notebook.
We’re typically using a lot of these types of run commands to return things like JDBC strings, secrets retrieval, which we’ll talk about later in this presentation and some common business logic transformations. It also lets us split our core logic out into different notebooks so we don’t have to go and touch our main master notebooks or edit our properties notebooks, when we just need to make a little tweak to our functions.
Number six would be managing our costs. This comes with two things, managing our jobs and our main clusters. We want to really understand our jobs and their requirements, and then understand our clusters and what they cost. And finally using dashboarding to do all of that.
First, let’s start with talking about cluster management. We always want to be asking that question, what’s the right cluster for my job? There are many types of clusters that we have available. For us in AWS, I think I have an option of picking between 30 and 40 cluster types. We want to think about what type of job we’re doing. If we’re using machine learning, do we need a lot more memory? Are we doing ETL processes or ELT processes? Do we need a lot more cores? Are we doing just some general ad hoc analysis so we only need those to scale up together, or do we need GPU capabilities for machine learning? A great way to look at this is to use the Ganglia UI to view that.
Next, we want to look at the job and how that relates to the cluster. What is our job doing? What’s the frequency that we want it to run? How often do we want that to complete? How big is the data? Finally, we want to test it using the Ganglia UI once again, to really take both of these things and look at how they’re reproduced on the cluster and how our code’s being utilized.
In this example of the Ganglia UI, we have a four hour time series spread across this dashboard. In these four hours this cluster has had a total of 10 workers at one time. Currently only two of those hosts are up and it does display four images. The first is the number of cluster load. The second is our memory load. The third is our CPU utilization. The fourth on the bottom there is our network load. We’re going to ignore the network load for now and just talk about the first three.
Under the cluster load, we can really see how many processes are fully being utilized. You can see a couple blips in the red there. That’s where new work we’re getting spun up and then getting spun down. You can see the same thing happening under the memory tab. By looking at this first image, we can really see that our number of processors really only hit 20, about three to four times within a four hour period. So maybe we can think about bringing the number of CPU’s down or about reducing our workers. The same thing can be seen on our memory load. We’re really only using 20 to 30 gigs of RAM. Although it did bump up to about 120 gigs of RAM, we never fully utilized it. So maybe we can talk about changing the cluster type, or maybe we can bring down the amount of memory that we’re using, or reduce the amount of workers. All of these things save costs and that’s really what we’re looking for.
Next we want to dashboard all of this information. So we use job monitoring and we created our own Tableau dashboard. You can use things like Prometheus or Grafana to do very similar things. But what we like to track, is we do our successes and our failures and our skips within the past 24 hours. This really allows us to get a brief overview of what we’re seeing throughout our Databricks jobs in the previous day. We also track our failures in the past two weeks over a time series, which is a great way to see if maybe we had some downtime or maybe there was an outage somewhere else. We also look at our most recent failures and why those are being caused. And then we have a list of stale tables. Why are these tables still stale? Why haven’t they not been updated recently? An example of that can be seen on the right.
Usage monitoring is also really key. So we did is between all of our business units, we ended up creating dashboards for all of our managers. We took those dashboards and essentially aligned all of our cluster costs and job costs, and rolled those up as well as showed all of the individual costs for them to see. We gave them a list of all of the top 10 highest costing jobs for their business unit as well. This allows them to see their total cost across all of their clusters and jobs, individual costs between those, as well as see things that they can go and improve on. By improving on this, ultimately we can be saving money.
Our last four lessons are all going to be around managing our users. These include things like user groups and permissions, cluster segregation, leveraging secrets, and training and best practices. Databricks groups and permissions are really important to us as a team. The reason that is because when we go and set all of these different types of permissions for business units, it allows us to protect certain parts of our Databricks environment from certain types of groups that don’t need access. So we’ve split all of our business units into their own groups, each with their own unique access to certain workspaces, their own production environments, as well as accessing clusters and creating clusters.
We were also able to take those clusters that those groups are assigned to and give them instance profiles, which can then pass and assume IAM roles in our AWS environment. That then further refines our permission selections, allowing them to either access specific S3 buckets or other AWS resources. A great way that we recommend doing this is by using teams or even user types like data engineers, analysts, job users and admins.
Cluster segregation is also really important. Would you really trust 120 users to manage their own clusters? I personally, wouldn’t, it’d be very difficult. But you can use things like cluster policies, which allow you to limit what people can create. So maybe you allow your managers to create clusters, but only allow them to create clusters with a certain amount of workers or certain machine types. What we ended up deciding was best is to create team shared clusters. Each of our teams have access to three cluster types, ad hoc, machine learning and ETL. Some groups have more than others. For example, our enterprise data warehousing teams have access to two ETL clusters. Those two clusters are shared by about five individuals. We have about 12 or 14 individuals then from our institutional research team, which include analysts and data scientists. They share four total clusters, two machine learning clusters with different versions of Python and two ad hoc clusters to do their normal analysis and reporting.
We have also taken all of our jobs and push them off to automated clusters. Anything that’s really running for a longer period of a few hours, we try to push off to automated clusters because of the cost savings that we get. That means that all of our jobs and our clusters on those automated jobs are owned and assigned to the business unit. So they’re responsible for that. We can also restrict those clusters by using cluster policies.
Databricks secrets is also an important utility that we use. Being able to physically block access to some of these passwords and API tokens and things like that, is really important for us so that stuff doesn’t really get out. What we can do is take our groups and give them permissions to those secret scopes. So essentially each group can have their own vision into what secrets they can have. In our example here to the right, we run this configuration notebook which verifies who I am as a user, and then returns the connection strings that I have access to via secrets. By passing this data along, I can then run a function which returns a full connection string that I can utilize to access a database read and write data to certain APIs, et cetera. By utilizing using this, it really helps separate and protect some of our valuable resources, like our enterprise data warehouse or data lake.
Finally, training and best practices. It’s really hard to train hundreds of users on new products. However, we really want to let our users learn and we really want to train them on how we want them to use the product. Tools like Databricks Academy had become really useful for us for new hires and for user trainings. Requiring new users to go through the Databricks Academy process for a specific track like data science or engineering, is great to have them do that before they come into our Databricks environment. They’re really able to get a feel for how Databricks works, how Spark works before they can actually go in and get their hands dirty.
Doing things like monthly tech talks and going over best practices, new features, or cool things that you’re working on is also really important. That allows people to go and collaborate and to be able to share their ideas. Not everyone gets to know that maybe Spark released a new feature or a new functionality or Databricks released a new thing called like repos or a new feature called SQL Analytics. So having these meetings is really important.
Finally, doing things like weekly office hours. So really having your engineers come together, your analysts come together and share their knowledge with each other. Have them work through their code together, have them work through problems. There are people in your enterprise that have solved problems that others haven’t, and together doing that altogether can really help push your enterprise forward.
All in all, I think understanding and accomplishing just half of these challenges prior to releasing Databricks enterprise wide could have saved years of work, thousands of dollars and a more secure and operating environment for all of our enterprise.
Now, there’s really four ones that I want to reiterate again. And that is our cluster management and job management. These are two things that can really help bring costs down in your enterprise. So really being on top of understanding how a cluster works, and which clusters to properly use, and how often your job should be running, is extremely important. Training and best practices is another one of those key lessons. Making sure that all of your engineers and analysts are learning what they best can to most efficiently use your environment. And finally, Delta optimizations. These are things that have really help speed up our process and also our analytical abilities.
That is all I have for today. We’ll now move on to a Q and A session. You can reach me via email if you any questions, or just want to talk about Databricks or Spark at all. You can also add me on LinkedIn to discuss anything related to the presentation or anything cool that you’re working on. As always your feedback is important so don’t forget to go and rate and review the sessions. Thank you.
I moved from Wisconsin to Utah in 2009 to complete my BS and MS in Information Systems at the University of Utah. I started my work at Western Governors University in 2012, eventually becoming a Seni...