We started out processing big data using AWS S3, EMR clusters, and Athena to serve Analytics data extracts to Tableau BI.
However as our data and teams sizes increased, Avro schemas from source data evolved, and we attempted to serve analytics data through Web apps, we hit a number of limitations in the AWS EMR, Glue/Athena approach.
This is a story of how we scaled out our data processing and boosted team productivity to meet our current demand for insights from 20M+ Smart Homes and 500M+ devices across the globe, from numerous internal business teams and our 150+ CSP partners.
We will describe lessons learnt and best practices established as we enabled our teams with DataBricks autoscaling Job clusters and Notebooks and migrated our Avro/Parquet data to use MetaStore, SQL Endpoints and SQLA Console, while charting the path to the Delta lake…
Sameer Vaidya: We hope you, your family and your loved ones are staying safe and are able to help others in need today. We want to welcome you to our session, my offering is universal message of peace.
You all are here to listen to our session to understand how did we grow our data practices to service insights and Analytics from over 20 million are smart home consumers with over 500 million connected devices over the history of time, okay? In order to get to our session, we want to first give you a little bit of a backdrop of what does Plume do as a business, what are its business imperatives, and what are some of the problems we are trying to solve? And the relation of our various data teams in solving these problems. I’m going to be able to talk about some things that are public facing and available on the internet. But from that, I’ll try to derive some internal facing challenges that we are solving today. So Plume is in the business of providing consumer experiences for the connected smart home.
We run our products in tens-of-millions of homes and monitor and manage various types of services in these smart homes, okay? To give you a sense of some of the things our business teams are looking for, I’ll talk about some business teams who are our stakeholders and consumers of our data products. So to begin with, out here you can see on plume.com we’ve offered most recently during our 20 million smart home launch celebration. Marketing people wanted to come and offer some insights from the data that we have so that they can launch campaigns on the public web. So marketing is one of our consumers. They look for things like what’s happening across homes, across the world, across different deployments of our customers, and based on insights gained from different types of connected devices. So here on the left, you can see a heat map showing all of our deployments across the globe, which we have to update on a daily basis.
Similarly, we have requests coming from product management. And the product managers want to understand how various features are working, whether people are turning them on or off, how they are performing in the field, et cetera. And again, that happens over all of our homes across worldwide regions. There is people from our products teams, which are product engineers or algorithms engineers who are trying to improve and optimize our product performances in the connected smart home. So they look for things like, how is our product optimizations performing? How is the connected device quality of experience? And these kinds of metrics and KPIs, they are looking to see our geographies over period of time and over different types of telemetry metrics that we are gathering. The most obvious ones that you’re seeing over here are things like traffic patterns over a period of day, internet speed tests and availability over time across time and across geographies, okay?
So another team that seriously uses a lot of our data products and insights is our NetOps teams. This is the network operations team that is responsible for rolling out features and format updates across lots of different regions, across lots of homes. And when they roll out new firmware and they roll out new versions of software, they want to understand that everything is still performing stably relative and competitive to what they were seeing before that change was introduced. So these are the various types of insights we have to serve for these different types of stakeholders.
And the teams that serve these insights span across three types of teams. We have the data engineering team that ingests and processes data by building pipelines. We have the data science team that uses our data to tune and optimize their algorithms and introduce product innovations. And we have the BI and Analytics teams that take the data and the Tableau that we are producing in our data warehouse, and that expose it through dashboards, ad-hoc reports and various types of insights that our stakeholders ask of us, okay?
So this is the context of our company and team structures and the kinds of problems that we are trying to solve. Now onto this, we’ll move into what’s the agenda for the session? We are… So I am Sameer Vaidya. I work as an architect in Plume, and I focus on data engineering BI and Analytics concerns. Later on in the session, my colleague Raghav Karnam is going to join us. And he is an expert in data science and machine learning engineering in the company. And he’s going to talk about some things that they’re doing with machine learning and share some insights with you from that. Our session today is targeted really at beginners, we’ve advertised it as a beginner level session.
And what we mean by beginner level is somebody who is relatively early in their journey in adoption of these various technologies for building big data Analytics. And specifically, if you’re interested in looking at Databricks, data platform, Lake house, SQL analytics, those kinds of areas. What we are going to talk about today is how we build our current generation of data processing platform on these technologies, and the lessons we learned over this, and what the overall environment looks like. We are going to be talking about how we use job clusters to improve our productivity. We use our notebooks for developer productivity and some areas about metadata management for data processing, for converting the big data that we are processing into SQL queryable data for various applications. And I will be focusing on that. And when Raghav joins us, he will be talking about machine learning life cycle, okay?
We do hope that those who are new to this area, again from our presentation today, and for those who are more experienced and you feel that you know a lot of this stuff, we also hope, since you have joined our session, we hope you do carry away a few insights and aha moments where you learn something from our session. And it’s worth the time you’ve invested in us. Should you like our session and what we have to say and what our company’s doing. We would love for you to refer your friends to apply for careers at Plum in data areas. Our journey in big data started from early days, so roughly over five years ago. At that point when we started looking at data processing and Analytics, we gravitated toward and selected the Apache Spark framework for all of our data processing.
And over a period of time, we’ve continued to use that and are successful and happy using that. And again, also from early days, we were a big AWS shop. So we’ve embraced and deployed on AWS across multiple regions in the world, very happy and successful with that platform, with that hosting service. And therefore from early days when we started our data processing with Spark, we naturally gravitated towards using the existing AWS big data processing frameworks at that time. For processing big data clusters, we use the AWS Elastic MapReduce, which offers us hosted clusters that you can click through and quite easily generate these clusters for your processing. Once you finish processing the data in order to expose it through our data SQL Analytics warehouse, we started using AWS Athena, which takes the underlying data and makes it accessible over SQL.
So think of Athena as our data warehouse. And in order to take the underlying big data and expose it through SQL over Athena, you have to use something called AWS Glue crawlers. I can almost never say Glue crawler correctly, but I think this time I got it right. The Glue crawlers take underlying file system, whether it’s Avro, Parquet, or JSON or CSV, and generate metadata out of it automagically to expose them as SQL databases and Tableaus, and you can configure it to do this. So this was our initial infrastructure that we got started out. And in early days it was working reasonably well when the team sizes were small, the amount of data we were processing was reasonable. What started happening is over a period of time, as the data evolved, we started getting more and more data. The number of developers increased. We started getting more and more teams building these jobs. And we were having slight difficulty keeping everything consistent and standardized.
What started happening, we started running into some bottlenecks and I want to talk about those as some of the challenges we were facing and how we overcame them. So we’ll start with talking about the Spark data processing clusters. What we found at that time, and this was roughly two years ago, is what I’m describing at this point in time. So these challenges were being faced two years ago, and then I’ll explain how we overcame them. So at that point in time for the Spark data processing, what we found that we were processing data both in development and production. And in production, our DevOps team like to control the resources and manage the resources for us.
And what we found is that as people started using the EMR clusters in production operations, and we started writing more and more jobs. We found it complicated to figure out how to size the EMR clusters, how to configure them and how to scale them and use them in a just-in-time manner for just the jobs that we were processing. It turns out that we at that point had not quite figured out how to automate this. And I don’t know if this is possible today or not, but at least at that point, we had an infrastructure where we would create a cluster and just leave it running because it was too complicated for us to terminate it, shut it down. And then again, reconfigure it and relaunch it. So because of this complexity, we got into this habit where the EMR clusters were just built once and left running. And once you left them running, then as we started adding more and more jobs, it became complicated for us to figure out how to size them, how to grow them and shrink them. And we weren’t quite able to figure that out.
For the clusters that were running in production, we found that for some of our batch processing which ran infrequently or just occasionally over the day, we were just running clusters all day long, and essentially considerably reducing the resource utilization of those clusters and unnecessarily paying for the resources that we were not using. So this was what was happening in production. In development, as developers wanted to create their own clusters, we went through a period of time where we let developers do their own clusters for a while. And that flexibility bounced back on us because in some cases, what the developers did was they left the large running clusters over long weekends and things like that and realized that they were not being super responsible.
And the reason for that is there was not an easy way to build a cluster and automatically time it out and come back to work and start it up and so on. So we call this lack of automation. We found that the developers were not able to use them very responsibly. And so we went through a period of time where we tried to create clusters and manage the sharing of clusters. And that got quite complicated because the logistics of using shared libraries on the clusters and so on. So these are some of the challenges we were having with clusters. At that point also in terms of the development processes, what we found was that most developers were using Sparks submit jobs and either assess searching into the cluster and running things on Spark Shell or just running jobs and looking at log files to do the debugging and troubleshooting.
And overall, we found that the developer experience and productivity was rather low. There was probably one or two resourceful, brave engineers who figured out how to use notebooks, but it wasn’t a widely adopted practice. So for some reason, we were living in these dark cave ages as far as the developer experience goes. So these are some of the challenges we were having with Spark processing. For actually taking the underlying process to data and exposing it, what we found was that the Glue crawlers… Sorry, my bad. I want to go back to AWS, Athena. I want to talk about challenges we were having with Athena. So with Athena, we were mostly happy. It was performing quite well for almost all the queries that we were trying to do. There were couple areas that were the gotchas for us.
One of the area is, while most of the queries worked pretty well, we found that in certain occasions, when our data scientists were trying to look for deep insights by running complex queries over large data sets gathered over large periods of time, they found that they wanted to run complex SQL. And they found that they would run it for several hours and then eventually it would just time out or run out of resources. And they were not able to actually finish those queries. And the thing we found as a blocker for us is that there didn’t seem to be any real solution to this. So out here you can see that this is the thing that you get if you run out of time or if you try to use the resources that they’re not willing to give you.
And this is because Athena is a serverless environment and serverless has pros and cons. The pro is that you don’t have to mess with the capacity planning and managing the resources around that, that that’s what makes it appealing and attractive. The downside is, should there be a time where you have a complex, large query you want to run, and you’re willing to pay for it because it’s valuable enough for you that you don’t mind using the extra resources, spinning them up and paying for it. There is really no such option, you’re just out of luck over there. And so in that case, the concept of serverless works against you because you don’t have the control to spin up additional resources that you need for the processing you need to do. So that was one thing that we didn’t quite like about it.
Another area that we had an issue was that some of the data that was available, we found a way to expose it to our support agents. And they found it quite useful to take this data and surface it into the support web applications. And they would use it for serving customer support calls and so on. And we thought this was great, we can expose this Athena data over these web applications. And it worked well for a while until the support volume started increasing, then as the volume started increasing, what we found out was that Athena gives you a fixed number of queues for fixed number of credits. And if one day too many support agents were working on the app, suddenly the queues would get filled up and no other processing could work. Everybody else is now blocked because all the queues are taken up by the support agents.
So we then realized that, “Oh, this is not good because it’s a shared queue across everybody who’s using this across the account and that’s not quite going to work. So that idea of not having control over pooling the resource and deciding which consumer of the applications gets how much of which resource, but facility was not available. So in this case, we started feeling constricted, constrained by the serverless concept. And the last area was related to the metadata management. The Glue Crawling is, while I said that most of the times as a developer built these underlying pipelines, they found it quite easy to configure these crawlers, point them at a location. And it would automatically build metadata. This evolved organically over a period of time. And it worked well for a while till it stopped working very well, okay?
And what we noticed was that our schemas started evolving over time. And as the schemas evolved, the versions of the schemas evolved. And also as we changed some processing technology, we found that certain files in the underlying file system were getting introduced differently than they were before, as we changed our technologies. And what we noticed is that if the developers are not careful configuring the crawlers with just the right exclude clauses and the right parameters, the crawlers would go buzzard and start generating a lot of craft or incorrect metadata that was not useful. So this is an example of the kind of thing that can happen based on some underlying situation. So the crawler suddenly start generating large number of data Tableau, because it thinks that the underlying partitions or files are data Tableaus, because it’s automagically trying to detect what the Tableaus are. And in this case, if you’re not very careful, or if there’s a certain situation and you don’t account for, quite easily you can make a mess out of it, okay?
And so we found that this idea of automatic metadata management system based on crawling was not a very good situation. So these are some of the things we were struggling with. And so that’s when we decided to look for a better solution. And we came across Databricks, and Databricks first demonstrated as this concept of clusters. And we absolutely loved it, because in the clusters, they showed us that you can create these standard clustered or shared clustered between developers. And it’s quite easy to create it and provision it. And then for your jobs, you create something called job clusters. And the job clusters simply spin up when you submit a job. And when the job is done, they tear it down and that’s it. You pay exactly for the processing you do with your jobs. There is no idle time, underutilized resources.
Also, the standard clusters that we started giving our developers, or we could give our developers based on their technology, they have this concept of building cluster, and then when it idols out, it just tells out and shuts down, you stop paying for the underlying compute costs. And when the developer’s ready to work with it again, they can just click a button and restart that cluster, right?. In some cases, you can just submit a query to it and it automatically starts up the cluster. So we found that to be quite appealing and we decided to embrace that. Similarly with this metadata management, we quite liked the concept of the Delta Lake.
And the Delta lake is a self manifest management system, which means there is no external additional crawlers that you need to look at the underlying file system and determine what the Tableau structure looks like and what the schema looks like, okay? The Delta system writes the data and the metadata as it updates the data. So the rights and updates are heavy, but the reads are speedy quick and efficient, and there is a way to optimize the underlying data. So we found that to be more appealing than having to build, manifest the automated metadata management system based on crawlers.
So these were some of the appealing things for us to get started with Databricks. And over the next few slides, we are going to talk you through our journey of how we transitioned to the Databricks platform, what were some of the challenges we solved to scale it up across our worldwide processing? And what are some of the benefits we gained out of that? So let’s start with topic number one, which is talking about operational productivity and developer productivity. And for that, we had to deploy Databricks workspaces all over the world in various regions of our processing. And then let the developers use them in every area of processing.
So I’m going to talk about how we went about that and what were some of the challenges that we had to overcome over there, and what are some of the best practices we gained out of that, okay? Now realize that Databricks is an installed product, which means it’s not like AWS native, where you can just click some AWS service and enable it and turn it on. You have to actually plan for a Databricks installation, which is called a workspace. And in our case, because we were operating in multiple regions and multiple AWS VPC is across the world. We had to plan for any number of workspaces. Right now, we are roughly under 10 of these, but we had to plan for these. And in each case, we had to plan for dual pairs, which is the idea for dev workspace and a production workspace in each of the regions.
So in order to plan for that, we realized that some of the things we have to be careful with is number one, we have to standardize the namespaces. What do I mean by namespaces? Any given Databricks workspace has a URL that you use to access and refer to it. So we then worked out that the best practice for naming should be consistently used everywhere, because if you get it wrong, unfortunately there is no simple rename command. You have to literally tear down the whole thing and rebuild it. So after one such expensive lesson, we learned our lesson in standardizing the namespaces. And today we follow a pattern, which is something like company name dash and domain name dash region name. And then after that, you get the .cloud.databricks.com. So we standardized that namespace for all the workspaces as we started installing them.
While planning for the workspaces, we also realized that we should plan for bucket names. Again, it’s one of those things where you have to standardize on the bucket names, because once you pick the bucket during the installation, you cannot change that bucket name and it’s best for the automation and DevOps and all those practices to have the consistent market names everywhere. So we learned that we should standardize the namespaces while creating these workspaces. The users and groups, so in AWS within our account across various VPCs, we could just create groups and users once. And the same one is available through AWSIM worldwide. But with Databricks that’s not true, because each workspace has its own concept of users and groups. So which means that we have to manage the users and groups in each of these workspace. So imagine like times nine or 10 of those, right?
But the good news is that they offer something called SAML single sign-on, it’s called skim integration. And with that, what you can do is you can use our SSO. So we use our single sign-on SAML identity provider, and we map our users and groups in the SAML SSO that our IT maintains. So anytime new users or new employees joined the company, as part of the onboarding process, ID provisions them into our single sign-on provider. And then the underlying integration automatically pushes the users and groups into the Databricks workspaces. And that allows us to do user management across these large number of workspaces. So that’s how we solve that problem. And then within each of these, you have various resources like clusters, jobs, databases, et cetera. So for that, we realized quickly that we need to automate the role based access control for that and apply it consistently to every workspace.
So we had to build some script to take all of these things underlying resource administration roles and automate those for applying them across all of these workspaces. So these are some of the things you have to be aware of and plan for. One thing we realized is also that over a period of time across these various regions and workspaces, the Databricks themselves are upgrading their underlying services. And over different periods of time, these can go through maintenance and upgrade periods. And we found that… I want to give a shout out to the status.databricks.com, which we found to be quite useful to understand when something doesn’t look quite right, and we’re trying to wonder if it’s our issue or their issue, you can look at statusstardatabricks.com and that tells you what’s happening in various workspaces and regions across the world.
So this was how we set up our various workspaces across the world. Now once we set up these workspaces, we wanted the developers to start using it. And what we noticed is that almost from day one, everybody quickly gravitated towards using notebooks as their integrated developer experience, as their ID of choice. And I’m proud to announce after a year, we took a survey and a number of developers claimed in the survey that they felt subjectively that they felt the productivity was up 30 to 50% by using and adopting the notebooks. So for the notebooks, the cool thing is that the notebook has things called cells. And in the cells, you can write in your language of choice. And so we have our data engineers who use Scala. So what they do at Scala now is they develop their underlying job and create jar files out of that, they upload it to the cluster. And then they are able to interact with it for development and testing, and just see how well their job is working. They’re able to interact with it through the notebook, to the Scala code.
Our data scientists are able to write a biotin-based code and then use the Python code to interact with their underlying Algorithms and so on against their data. And our SQL analysts are able to use SQL and they use the SQL cells to interact and play with the data and do their development process with the notebooks. So that all is working pretty well. And one of the interesting things we were able to do is previously our analysts would rely on the data engineers to convert anything into a scheduled job, but with the notebooks now, what they’re able to do is they’re able to write some code in SQL and click a button and just schedule it as a scheduled job. So it’s quite easy. So you can just write your code and your logic for the job, and then schedule it and Databricks will run it on whatever schedule you specify or that. So they were able to do quite a quick and easy job of getting something and rolling it out as a job without waiting for data engineering team to do it for them.
Over a period of time, we learned a couple of things. One is that the notebooks in early days were not having underlying Git support, but recently they introduced GitHub repository support, and we are quite happy with that. The GitHub repository allows us to map the notebooks to underlying repository and manage the versioning of that in the underlying repository. So certainly you should look at GitHub repos. And then in terms of scheduling, we learned that all of the scheduled jobs are quite nice and convenient that we were not hooked up into our Airflow-based job monitoring system.
So over a period of time we converted those over into Airflow to get the monitoring. So we stopped using the Databricks scheduler and simply scheduling those jobs using Airflow to get the job dependencies and the monitoring. So we started using Airflow for that. Another thing you can do is if… This is not adequate, there are a couple of other alternate experiences that we are using for developers. So one is that you can configure using Simba Drivers. You can configure a SQL ID. So you can actually connect to Databricks clusters using your commercial SQL ID that you might like better, or you can use your developer ID like IntelliJ or Pycharm, or something like that, and submit your code into the clusters for Spark using a technology called Databricks connect. And our developers you are using a combination of all of these technologies will improved their productivity for development.
We’ve talk about clusters now. And for clusters, I mentioned early on about job clusters and high concurrency and standard clusters. So with job clusters, what we are doing is we are submitting jobs from our Airflow jobs and each developer who is responsible for the job gets to decide how much resources that job needs. And they’re writing the Airflow tags to create the job sizing… Sorry, the cluster sizing, and submit that that job into the cluster. And then Databricks will spin up a cluster as you specify, but on your job and tear it down as I mentioned, okay? So in this, what we’ve done is we use the concept of cluster policies so that we can keep some amount of limit and sanity checks so that somebody doesn’t go too crazy with it without us finding out about it, okay? So we are using the job cluster policies for that.
We played with standard and high concurrency clusters. And because of the different points in time where developers are working and not enough concurrency, we concluded that the standard clusters were better for us. And the high concurrency cluster is a lot more expensive to run, but it allows… It’s built in such a way that it balances all the incoming requests. So if you have a number of developers working simultaneously, you should use the high concurrency clusters, but if you have sporadic use sparse usage of the clusters, you simply give them the dedicated clusters. And over that, we uncovered some best practices from that. One of them is that for the clusters, we initially tried to use Spot Instances everywhere. We got excited that these Instances can be quite cheap and you should use Spot Instances. And then what we learned over time is that that’s a bad idea, because what can happen is, sometimes the job will run for three, four hours, five hours, and then it will terminate because the Spark Driver node just goes away because of the underlying Spot Instance being taken away. And now your entire job fails.
So what we learned is that we should use the reserved Instances for the Driver notes and Spot Instances for everything else. And that has been working pretty well for us. Not that it should be uncovered was that when developers submit these jobs, they have to submit the jobs under their own credentials. And once you do that, no other developer can go and troubleshoot your job. So if your job failed and one of your teammates wants to help out using this job or they are troubleshooting this job, that was not being possible. And then Databricks introduced this thing called service principles. And the service principle concept allows you to take us essentially the equivalent of a service user that is headless, and you can create that in your group and use the service principle for launching your jobs. And once you do that, then the people in your group are able to actually see the logs and do the troubleshooting and so on. So we uncovered that we should learn service principles for managing jobs.
One thing we learned is that this idea of subnets, I should have probably mentioned it my earlier slide about planning the workspaces, but this was a lesson we learned after the fact, at one point, we were having some issues with our… Some processing. And we uncovered that while installing the workspace, our DevOps team had installed it in the existing VPC with some other services, whereas Databricks best practices recommend that you have a dedicated subnet with adequate subnet space. And we learned that lesson the hard way. We had to tear it down and rebuild the workspace in the appropriate dedicated subnet.
So do remember to give it a dedicated subnet and give adequate room for us to grow. Because again, this was something we learned that in the beginning, we thought how many Instances can we use? Right? But as our processing grew, we suddenly started realizing that we can use quite a few Instances and that we should probably plan for a large enough subnet that we’re not worried about running out of space. So this is something you should plan for to build a large enough dedicated subnet for your underlying clusters and workspaces. For high availability, I mentioned that we use Airflow for our job scheduling. And what we learned over a period of time is that Databricks services and APIs go through a maintenance period. And what that means, maintenance period is that for some period of time, your job might invoke the API to launch the cluster and run your job, but that invocation fails because the API is undergoing maintenance upgrade.
Now they are kind enough to announce it in the status channel and the alerts and email alerts and those kinds of things. But never the same, you have to account for it in your job launching and execution. So what we learned to do over a period of time is to retry the jobs because when these APS go away, then they do come back in 15, 20 minutes. And then we learned that instead of just letting it fail and dealing with it manually, we’ve built a logic into our Airflow jobs now to relaunch the job within a few minutes and try it for up to 30 minutes. And now that has been working pretty well for us. Another thing we learned is that in some situations, as the number of Instances starts growing and our data processing increased, we learned that there are situations where AWS underlying availability zone in a certain region runs out of certain types of Instances.
And when that happens, that is nothing you can do. On one occasion, we complained to Databricks support, but then they said, “Look, we rely on the underlying AWS to give us Instances. And if they don’t have that instance type, there’s nothing we can do.” Which we thought was reasonable. And so then what we learned to do is to retry the jobs in such a way that you can ask it to run in a different availability zone. And then that’s working much better for us, where now if the underlying availability zone is out of Instances, we simply try another availability zone and get our job executed in that availability zone. So you should plan for such retry logic to try in different availability zones should the original one run out of Instance types. Another thing we experienced, and this is very much real. They talk about this in their documentation, something called item potency of jobs.
So remember the example I gave you where the APA submission used to fail? So in this case, what happens is we found some situations where our retry logic would resubmit the job and the same job would run two or three times. And this actually created quite a few problems for us, because in some cases it would write multiple files underneath, and it would mess up our logic which relied on counts of the underlying records and things like that. So it actually gives you bad data outcomes. And we have to figure out how to deal with it and fix it. So then we learned about this thing called item potency tokens. So now when we submit our job, we submit it using the item potency tokens. And that has now made every job completely unique in time and space. And we no longer suffer this problem when we retry the job of having multiple submissions on the job.
So highly recommend using item potency tokens for your job submissions. So these are some of the lessons we learned for operating the clusters. And this has allowed us to pretty much for the developers through most of this in a self-service manner and considerably reducing the involvement and overhead of DevOps for managing and maintaining these clusters.
Now, as people started using these clusters, at some point we had to renew our licenses and figure out what our spend was. And up until that point, I’m a bit embarrassed to admit we had not been doing a very good job of tracking our spend, but now we’ve learned our lesson. And what we learned was they have these wonderful usage reports where you can use AWS tags. And as long as you tag it correctly, you can actually slice and dice your spend in quite a few ways. And so what we learned to do over time is to apply tags for the different ways we wanted to see our spend. So by the cost center, by what kinds of projects we were spending these clusters for? Which team was using it? Which individual owner was responsible for something? And which environment and which region the processing was running? And by creating these kinds of AWS tags for launching the clusters, we are now able to go and look at the usage reports and slice and dice them.
The usage reports are available directly through the Databricks account management console, or they can also allow you to export the data and build your own custom reports. So we started doing that to export the custom reports and then do our own reporting for usage based on Tableau. So this is very useful. So I highly recommend you plan for this right from inception.
All right. So we talked about establishing the clusters and establishing the developer experience. I want to talk about… Shift gears and move into the metadata management and how we are managing query performance and how we move from hierarchy and SQL to the Delta Lake. So we talked earlier on about Glue crawlers, and I’d mentioned that Glue metadata system keeps the Metastore updated, and a Metastore is essentially a definition of Tableau and how they map to the underlying file system, okay? Now, when you move to Databricks, we cannot try this idea of a hybrid idea to say, maybe we can leverage and reuse the existing AWS Glue System, et cetera. But that actually turned out to be quite complicated, and wasn’t giving us a clean suppression between these concerns. And so after struggling with that for a while, we decided to just create a clean slate and completely separate what we were doing with AWS, Glue and Athena from what we were doing at Databricks.
So we decided to completely embrace the Databricks Metastore for managing metadata for Databricks. And in that now, we wanted to start out by using our existing everyone Parquet files, and simply using that to map the metadata into the Databricks Metastore. And we uncovered that data is actually a little complicated, because what that requires you to do is, it requires you to write your own equivalent of crawlers. Because now remember that while I complained about Glue crawlers, the good news was they were at least there to automate the thing for you. But if you want to do your own Parquet and Avro using Databricks, then you have to really work out your own effective metadata management system. And we did some of that because we wanted to shift over into Databricks and then slowly migrate over into Delta. And I’m going to talk about what the Delta migration experience was, but what we uncovered by using Parquet and Avro through Databricks SQL queries is that that really doesn’t work as efficiently and optimally as the original the Athena system does, which is based on a different technology than Spark SQL.
So based on those experiences, we basically decided to move to Delta Lake as quickly as possible, because Delta Lake arranges the underlying data in a more optimized manner, and that works much better with Databricks. So if you’re planning to migrate from Legacy, Parquet and SQL… Sorry, Parquet and Avro into Databricks, then you really practically speaking, you should immediately plan for migrating to Delta Lake because the Legacy underlying Avro Parquet support while exists is not very good. The SQL query performance and all is not very good over there. And you have to maintain your own metadata, writing your own crawler system. So I would avoid doing that if I had the opportunity. So we had to go through that system and create crawlers, but I would avoid doing it.
Now, what we have to the rescue is Databricks has this technology called auto loader. And our resident solution architect helped us configure the auto loader and it worked quite well. And with the auto loader, what you can do is you can just write a few lines of streaming code and point it at a source. And essentially in the streaming code, you can convert it into an… Write it as Delta and keep it up to date. So it becomes like our ingestion system to ingest into the Delta system. So that is something I would recommend you give a try. But I want to spend a little bit of time on how we vent about migrating our existing Parquet to Delta, because I think that’s a challenge we are to undergo. And some of you have existing Legacy that aren’t trying to do this are going to have to do the same.
So out here, the first thing is that you have to uncover where your Parquet and Avro data is being used. And which means you have to have quite a clean in-house catalog and data lineage understanding of what is the data? Of where is the data? Who is using it? And where is it coming from? Right? And for this, you have to have all the underlying metadata. What I mean by metadata is the DDL. For each of these data, you have to have the definition language for each Tableau and how it maps to the underlying data. So we started by cataloging all of our parts and all of our schema DDL for these underlying data. And so once we did that, we started planning for how to convert the underlying Parquet into Delta. Now realize one thing that there is a quick and easy command for converting Parquet to Delta, because essentially Delta is just Parquet plus, it just write some metadata on top of Parquet.
So that is quite an easy, convenient command. But if you had Avro, there is no such easy command for Avro. You actually have to do some job ingestion to convert the Avro into the Delta, okay? But luckily for us, a lot of our underlying data was Parquet. So we were able to do this a little easier. So for this, what we did is, we had to first plan for which other consuming pipelines? Okay now realize this, that the first step in this is what I call a big bird thing bang, where you have to take all the underlying data and you have to convert it from Parquet to Delta. And then once you do that, that solves the problem for the initial data that got converted. But imagine this on the very next day or the very next hour, some job is going to run and it’s going to write more data, right?
So then you don’t want a situation where the job is writing Delta… Sorry, Parquet data into the underlying data that you convert it to Delta. So then what we had to do is we had to figure out which all jobs we had reading and writing the underlying Parquet data, and we had to modify these jobs very surgically to convert our read from Parquet write, to read from Delta write to Delta as appropriate, okay? So we had to take all of these pipelines and convert all of this code and have this ready. In some cases we also found that there were external consumers of our underlying data who were directly loading the data and stuff using SQL. They were directly loading the data from the underlying file systems. And we had to coordinate with them to say, “Hey, we are converting this into Delta and you guys should be ready. And when the time comes to shift over along with us to this Delta system.”
So after we did that, we then created scripts to automate and convert all of this data. We paused all of our pipelines, downloaded all of the data, then started the pipelines with the new code, which was reading and writing as Delta, okay? Then resume all the pipelines. And next day onward, all of this code is now reading and writing Delta format. And this is how we did our conversion from Parquet to Delta. And in order to maintain a backward compatibility with Athena, there is a way to take the Delta data and expose it through the manifest system and make it available to Athena. Now, for that, we had to also create our own MSE care repair mechanisms to let Athena see that the underlying Delta data is updated.
So you cannot use the existing Glue crawlers anymore. You have to build your own mechanism to take the underlying data and refresh Athena’s metadata system periodically to let it see the updated Delta data changes. So this was our path to conversion from Parquet to Delta. And so at the end of it, we now had effective developer productivity. And the underlying data was converted to Delta and giving us automated metadata management, and also better performance because of the underlying Delta system having knobs for optimizing and consolidating files and so on. So based on this, we are now able to use SQL analytics, which exposes this underlying data through SQL. And for that, we simply took the existing usage of Athena and started shifting them one application, one user at a time. Now that our Databricks Metastore was implemented and populated, we were able to start shifting these underlying applications from using Athena to using Databricks SQL analytics.
And for the SQL analytics, we started by doing a SQL endpoint, which we call general purpose and started onboarding most of the applications to it. And only over a period of time, and somebody had a specific need for doing more. We started creating for the custom endpoints and clusters. Also, note that in the SQL Analytics product, if you are not very excited about the SQLBot that they offer, we are able to use an external SQL ID like a DBeaver. And also SQL Analytics is able to solve web APIs. The thing I was complaining about running out of resources. And we are now able to use SQL Analytics endpoints by scaling up the underlying clusters to meet API performance. So this is the end of my story. I talked about how we overcame a number of issues with workspace management, clusters and notebooks, how we upgraded to the metadata and embraced SQL Analytics. I hope you found some of these insights useful. I want to transition and hand this over now to my colleague Raghav. And he’s going to talk about what he’s doing with ML in the company, over two you Raghav.
Raghav Karnam: Good morning and hello to everyone. I’m Raghav. And I lead the machine learning and data science team at Plume. I’m here to speak about the machine learning life cycle at Plume, and how we help from Databricks to automate a lot of things. So Plume on business front focuses on a lot of machine learning use cases. The first and the primary one, we started it was able to cater wifi all the homes in a seamless way, okay? To give the wifi in a best possible, and give [inaudible] the best experience of wifi. We need to understand what devices which are there in your home and what movements you have in your home. For example, most of the times in your living room, we have to adjust our wifi to your living room. And if you have most of the times a TV which is a five gigahertz, we have to adjust a wifi cording.
To do these things, we need to understand what the nature of the devices which are in your home, the devices in your home, the moments you have in your home. This is one very big space where the machine learning has helped us. The second area was, once you have this wifi, what do you want to do with the wifi? You want to do a lot of things, but you want to be safe. You want to be protected while you’re going on internet. So that’s where our second use case comes up, cyber security. [inaudible] I do digitally protect our customers from external attacks. Cyber security area, and we do lots of machine learning in that space. Then likely, like the topic indicates, we are in 20 million plus holes, 500 million plus devices. That’s a lot of data coming in. And how we can help our consumers, Plume is a consumer hold services business. Like say, wifi is our first service.
The second is a cyber security. The third, is prevent doing proactive maintenance. Is there a way we can know that you’re having a glitch in your wifi experience and fix it even before you realize it? That’s a huge area of interest for us, because we want the wifi experience to the best for you, that’s another area. Then came another use case that, how we can digitally… We even protecting you digitally. Can we do physical protection for you? For example, today when you are away from your home, let’s say you’re using [ring] or any other product. You have sensors on your door, et cetera. And when you stay home away, if some intruder comes in, you get a popup saying that, “Hey, there’s some intruder,” right? But here you need an external sensor. Can we do this without using external sensors with your access to wifi access points?
That was a challenge we wanted to do. And how do we do that? Is another area of machine learning purpose. Can we sense the motion detection using wifi signals? Yes. The answer is yes, we have solved it to some extent, yeah. And within the motion, there’s a lot of challenges, like you’re away from your home but your pet is moving in your home. You should not be calling it an intruder. So there’s a lot of indignities where we have evolved and improved ourselves. And then there’s a lot of innovation areas within Plume. We have evoking, we can probably not discuss like this with all this information learning happening within Plum. So these are the machine learning areas of focus for us. And there’s a huge expectation for machine learning teams, because we have to be very fast on up to the needs to market, at the same time, adapt to the scale at which we are growing.
Now, let’s go back to our day one. What are the challenges we faced in our first generation of ML lifecycle and MLOps, how we ended up at Databricks, okay? Basically, our evolution started with a single focused cushion. Initially, when we started doing data science and machine learning, we had a very small team with enthusiastic engineers who wanted to do a lot of cool products, which will go to the market faster. And we did it like everyone. We had a single machine learning engineer and data scientists responsible for doing fetus stages of life cycle. But at some point of time, if you want to increase the rate at which the feature scope the production, okay? And that’s when we started thinking, what are the various stages and how we can optimize each and every stage. For example, this is a very standard process of, like our steps involved in machine learning.
You read the data, you do each and everything, you do a lot of things. How do you do that? The data engineers are supposed to curate the data in the raw format. Then building of the model is generally a data scientist responsibility. So within that, how do you tune performance and tune your metrics? Keeps with the thresholds. Deployment models, we generally have ML engineers doing that. How do you do that when existing production model is there? And you add additional model in A/B testing mode, how can you do it seamlessly?
Then the final last step is when you’re integrating the models, you’re interacting with a lot of software engineers who are building microservices, how fast you can integrate with this other teams, and how well they can understand the machine learning models you give them so that they can define simple pass or fail. That’s another phase.
The last phase is not operating the model in production. How do you see the model drift or data drift, et cetera, okay? So this is the areas of concern we had. And like Sameer was pointing an initial slide, the data scientists used a lot of these things. And we wanted to break down both in terms of process, in terms of people, in terms of technology tools and optimize these things. The people… But you can see is clearly defining our journey. This is the people part and the team’s part, okay? Perhaps the tools and technologies, the data scientist was initially Viva’s small team and couple of us were smart enough to host the notebooks for the entire team, and we were going. But it was not taking us too long. Then to track the ML experiments, performance metrics. We had an open source MLflow on a specific server with attacking experiments. But they such a burden was on top of the data scientist’s head. So how do we adapt this? How do we free up the space and the burden on the data scientist? That does have one question to be asked?
So that was the first question which to let us do a discovery of Databricks, even the Databricks offers an ID called Databricks notebooks. These notebooks help us to spin up and spin down the Spark clusters. Previously, in a previous old data scientists use to spin up the EMR cluster and host notebooks in the self MLflow clusters in cells. That’s where we picked up Databricks because it helps us to spin the clusters up and down. And the data scientist doesn’t need to worry, and the ID, notebooks they are always hosted by Databricks, so we don’t need to worry about it. These are the cool factors which took us in.
That was the first reason regarding to Databricks. And then we truly used it for model performance metrics, MLflow and models history. MLflow for both tracking the performance metrics and the registry, okay? So I’m walking through with a demo. And our guide to balance this demo and everything in such a way that it’s caters to the interests of the beginners and the people who are already using it. And I’ve taken the liberty of using some open source data for this case of a demo, okay? And general architecture between the Plume with MLflow, this is an offline architecture [inaudible] real time one. This is how it looks like. We have various Databricks with workspaces.
We have Databricks workspace for in general combination, we have templates Databricks workspaces. We have a general combo of Databricks workspace and the production workspace, okay? The way the development of workspace and the production workspace interact is pretty common repository, that’s where we store our models, okay? And this is how the general architecture looks like. Most of the times our data scientists and ML engineers are pulling the data from S3 or [Mongle] or some other third-party data sources. They’re doing the feature engineering on the Spark cluster. And they have their own ML framework. And we use Horovod for the deep learning framework. And we use MLflow for tracking our experiments, and then MLflow repository for registering the models. And the registered models are used by the production Databricks workspace. This is a general architecture which we have done, okay? And one of the good things we have all about is for example in case of Plume, we have multiple workspaces. We have a lock space for EMEA, Databricks workspace for EMEA, Databricks workspace for US, Databricks workspace for Asia Pacific, okay?
And written that, each and every region for example, EMEA has its own development intelligence and development workspace. And it has it’s own staging workspace. It has a production workspace. So what we do is all our development happens in something called development space, where we do build our models and then we push it to the staging to a common shared model deposits. So we have reserved one workspace as an shared modern registry or workspace. That’s where we check in our models. And once the models are registered, they can be accessed with a different workspace. It can be in a different region also. They can have centralized workspace in one particular region. And the workspace is in a different region can access that models.
So that’s what we have set up. And in the coming slides, I would like to demo for you all these things so that it becomes much more clear. Rather than being an architectural slide, I would like to show you guys from the something we’ve done with an small example. In this demo, what we want to show is how MLflow can be used to train, lock and track the machine learning models.
For this case, I’ll use a deep learning model. It demonstrates using LSTM autoencoder. But more important is we want to show you how MLflow helps us, okay? So within MLflow, you can always set an experiment whatever name you would like to. In my case, I’m just giving some random name over here and you can always watch experiments by clicking over here. So that’s how you create [inaudible] on your experiment. Then start your experiment with a particular run name.
In this example [inaudible] a latest run time as such, but you can do your custom naming convention for starting your run. Once you do that, you can do… The rest of the steps are very regular. You get your train data, you load them up. Then the most important step within each and every machine learning step like ILA, what are the parameters you want to lock? For example, when using your MLflow experiment, if you observe here, these are the two experiments which I’ve run in the past. If you click on this experiment, you see a lot of parameters which are very necessary for your model tracking. Then there are metrics, like loss or the number of anomalies or any customer data that you guys are defining. So these are the metrics. So each experiment will have these parameters and metrics. So based on your needs you have machine learning algorithm, you pick up these parameters and metrics. Few of them can be constant parameters. Few of them can be metrics, which will vary which [eek] with each and every run.
Then it also has artifacts, if you can observe this artifacts. It has an ML model conduct, viable file and everything. So going back, right? If you go back to the previous slide… Just a second. So once you create your experiment, you run your experiment with that particular name, and you have to define these parameters and consecration metrics. In my case, cost is nothing but each and every business segment. Imagine your model is going back with these segments or various countries, various segments of products or various segments of countries. You want to slice and dice and monitor your model performance on those slices. So it’s very important how to slices. You might be having specific metrics in your mind. You might have some time step seasonality based ones involved. So whatever it is, you have to set those MLflow log parameters.
And this is how you set your log parameters. Once you send you a log parameters, you can follow the general process in building your deep learning model. I’m going really fast here because these are regular steps. I’ll use the tuner, which is a [inaudible] space tuner. You can use any tuner for your hyper-model parameter tuning, okay? Then the last thing which I wanted to explain is this is how you’re living your parameters where you’re fixing the bad sides and everything. And then this will be my architecture to how I know LSTM layer and dropout, et cetera. You use your own architecture. Once you train your model, okay? And you log your metrics, okay? What you’re supposed to do is you’re supposed register your model. That’s what I wanted to cover. This is all training the model. One of the most important thing which I found is how do you wrap your pre-processes, the feature engineering and the model together so that you can get in a Python class, which can be leveraged by inference engines.
So they have a very beautiful thing called MLflow PyFunc model that you can add your custom pre-processing as well as the new model together so that that can be [inaudible]. So that’s what we are doing here. We have a custom pre-processing data processing module able to leverage your model. So you’re also going to set your environment variables and you can set your labels, okay? Once you do it, you [inaudible] Python model, and then now we can move and register the model to a remote repository. I’ll just show you how do you register the model to a remote positively.
You want to find out that your best experiment, in your case, your [inaudible] function or whatever you’re choosing, right? In my case, I’m just setting this experiment. I’m going to use this code to pick up the experiment based on the logs, okay? And I run my logs, I get the best experiment. Then using the registry document, this is what I wanted to share you, like in the slides I was sharing you how multiple workspaces and now, one workspace which has an common registry. So in our indicator, for the case of this demo and a particular registry URI, where you can set your registry URI And then you can push just show your model to that particular run off your model to that particular registry, okay? This is a little tricky, the registry URI and this workspace will have access to the common workspace using keys, okay?
And that is nothing but a scope and a set of three keys confirm. Once they are configured, you can access the common repository. This is how you register your model. Once you register your model, I just want to show you the registered models how they look like, an example of registered models is like, let’s say once after the discharge, you can see various motions of it each and every version, how it’s performing and everything. So this is the version three of integrity. So the sake of them, I did yet pull up the runs and everything, but that’s how you’re registered models look like.
How does your inference look like? The inference looks like this, just give me a second, and all your inference model, so you’re using your PyFunc model, you can invoke your data. And this is very important because you are the history. You already used the one which you have to carry across your workspaces, based on the real issue you arise, you pick the latest version of the model, or in my case I’m taking the old modern version two.
You can automate this and you take a particular model limit, particular version and load that model here. And then you can run your inference on top of it and voila, you have your results. I think this is where I want to conclude. In this case, these are all the regular trends. The red dots are nothing but the anomalies or outliers in the performance. So hope you guys got some glimpse. I had tried to give an overview of how you use MLflow for logging your metrics and running, and registering your models and your inference. Thank you everyone. Have a nice day.
"Sameer Vaidya leads data architecture at Plume, serving the exponentially growing demands for Analytics and BI/insights from Products, Marketing, NetOps, Sales, Finance/Accounting and 170+ CSP custom...
Raghav Karnam leads ML Engineering at Plume, serving the models which impact 500 Million plus devices on our network. His team is focused on all aspects of ML
1) Infrastructure & Tooling for ...