Comcast is one of the leading providers of communications, entertainment, and cable products and services. At the heart of it is Comcast RDK providing the backbone of telemetry to the industry. RDK (Reference Design Kit) is pre-bundled opensource firmware for a complete home platform covering video, broadband and IoT devices. RDK team at Comcast analyzes petabytes of data, collected every 15 minutes from 70 million devices (video and broadband and IoT devices) installed in customer homes. They run ETL and aggregation pipelines and publish analytical dashboards on a daily basis to reduce customer calls and firmware rollout. The analysis is also used to calculate WIFI happiness index which is a critical KPI for Comcast customer experience.
In addition to this, RDK team also does release tracking by analyzing the RDK firmware quality. SQL Analytics allows customers to operate a lakehouse architecture that provides data warehousing performance at data lake economics for up to 4x better price/performance for SQL workloads than traditional cloud data warehouses.
We present the results of the “Test and Learn” with SQL Analytics and the delta engine that we worked in partnership with the Databricks team. We present a quick demo introducing the SQL native interface, the challenges we faced with migration, The results of the execution and our journey of productionizing this at scale.
Molly Nagamuthu: Hey everyone, thank you for being here. Welcome to a talk on SQL Analytics. For the agenda today, we will set the context for SQL Analytics by briefly looking into the Lakehouse Platform, what it is, why we need it and where SQL Analytics fits in. We will also take a deeper look into SQL Analytics. My co presenter for today, Suraj Nesamani, will follow it up with all the details of the telemetric use case that we tested it for. Before we dive in a brief introduction about myself, my name is Molly Nagamuthu. I am a resident solutions architect at Databricks. In this role, I have the amazing opportunity to work with our key strategic customers, helping them in the toughest data and AI implementations by leveraging our Databricks platform. I have 20 plus years of experience in product development, engineering, and professional services. This has been a rewarding experience in diverse industries at some of these wonderful companies that I’ve had the opportunity to work with. I’ve been with Databricks for about two years and I’m loving it.
Let’s get started. Datebricks’ vision is to enable data-driven innovation to all enterprises. And to make it all possible, we introduced the Lakehouse Platform. Lakehouse is a Databricks’ concept introduced to make it easier for businesses to combine the best of data warehouses and data lakes into one single platform. And why do we need this? Because despite all the advances in the field, most enterprises still struggled with the data. If you need to build an end to end data-driven enterprise, then you need to build four different stacks that are shown here to handle all of your data needs. This that is data engineering, data warehousing, real time ingestion and data science and ML. Typically, as it stands today, they are very different technologies and they generally don’t work well together. The reason for this is, as you can see, there is plenty of tooling that is available.
The only issue is that they’re disconnected. And in some cases you have to deal with a variety of proprietary formats. When you add personas to it with varied skillsets, now we are talking siloed data teams, which do not communicate effectively and as a result, decreased productivity, the core problem is the technologies stacks are built upon and to solve these problems, Databricks’ offers a complete Lakehouse Platform. It operates in the cloud across all the clouds. So it is multi-cloud and tightly integrated with all the different cloud native capabilities. Your data can remain in an open data lake.
Databricks with Delta delivers the data management and governance to ensure security, reliability, and performance. It supports all workloads on one platform from data engineering, BI data science and real-time applications. Keeping all of this work in one platform, simplifies the work of the data team and allows for easy collaboration. Now let’s focus in on Databricks SQL Analytics.
So far in this journey with our 5,000 plus customers. And as the original creators of spark Delta animal flow, we have gathered a lot of experience working with enterprises in every industry. We have established a proven track record for data ingest ETL processing and machine learning. The only gap needed to process, was to provide a robust mechanism for BI and SQL Analytics.
And here we have the newest addition to our suite of tools, Databricks’ SQL Analytics. It offers a home for SQL analysts within Databricks and provides the ability to query your lake house with better price, performance, simplify discovery, and sharing of new insights, connect to familiar tools like Tableau or power BI and simplify administration and governance. In addition to the native quiddity and visualization tools, SQL analytics provide support for all of your existing BI application. There is a pretty broad set of BI tools, including Tableau and Power BI. You can connect SQL analytics endpoints for reliable data and fast query performance.
Now let’s look at some of the broad use cases It supports. They’re mainly threefold. The first one is maximize existing investments by connecting your preferred BI tools such as Tableau or Power BI to your data lake with SQL analytics endpoints, the connectors are completely re-engineered and optimized to ensure fast performance, low latency, and high user concurrency to your data lake.
The second one respond to business needs faster with self subbed experience designed for analysts in your organization. Databricks SQL Analytics provides a simple and secure access to data. You have the ability to create or re-use sequel queries and quickly mock up and iterate on visualizations and dashboards that fit best for the business. And last but not the least, you can build a rich and custom data enhanced applications for your own organization or your customers. Now I would like to point out that the key here is that you’re doing all of this on your data lake and not moving data to any tool to get these insights. Data is always available on your data lake.
Now let’s see how it works and what are the components that makes it possible. We first need to make sure we have curated data in the data lake in an open format. It does… it does not matter whether it is structured or unstructured data. The data lake becomes the foundational piece of the architecture. Of course, we would need a native sequel interface to query the entire data lake and be able to visualize these results. If we are connecting existing BI tools, then we would need optimized connectors to connect your data. And these are provided by Databricks. The next biggest thing is the computer itself. If you have used Databricks in the past, you know how easy it is to set up your compute by creating and configuring your clusters, but SQL Analytics takes it a step further with sequel endpoints and abstracts a lot of that complexity and makes it a lot more simpler.
The most important part of this is a powerful, completely rewritten vectorized sequel engine. With all the performance boost you can get. To see all of this on a Databricks workspace, let’s break into a quick demo. Just to clarify, this is not the demo of the use case that we are going to present, this is a continuation of the previous slide to see all the components that we just discussed and see how they look like in a Databricks workspace. Let’s go ahead and log into a Databricks workspace. And once you log into your Databricks workspace, you will see your regular data engineering workloads and your clusters and everything; the same that you’re used to. You will also be able to see your ML end to end pipeline, all your ML models and, and things. And if you, if sequel analytics is installed in the workspace, then if you will access, you can access it from here.
Let’s go ahead and click that. And once we have it, this is the secret analytics home screen. So for let’s look at the exploratory data analysis use case first, we can go ahead and create a query or a dashboard or an alert. And once we click on create query, you see the SQL editor here, the editor automatically it comes enabled with auto fill. So when you, when you type in, you get to see what options are available, and then if you want to disable it, you can all, you can always disable this by clicking here, you can create bottom and try square is very easily if you, if you need to. And that is the data browser for all the databases that you have access to and access can be controlled pretty easily using- using grand statements. And you can see the schema for the tables and, and the other databases that you have access to and everything.
And here is the endpoint. We will talk about this in a bit. This is the main compute for SQL Analytics. Let’s see, let’s load the query that we already have. We are just trying to explore the TPCs database. And it’s just, I’m just doing- finding a simple query to look at what call centers that are available. Let me go ahead and execute this year. Won’t take up too much time and yeah, we have that. And for the results we can actually add visualization very easily. You can look at- you can pick the x-axis y-axis and anything, any aggregations by columns, and you can, you can create whatever you want. We just fired a very simple query, so I’m just going to use the simplest visualization possible on this. I’m just looking at how many employees are distributed for- per call center for each of these cities.
And once you have a visualization for the query, you can easily add that to the dashboard by just adding, adding that visualization to the dashboard. And now let’s see how we can pull these dashboards. So we have… We have a dashboard where we added that query. We have another query that we just added the dashboard to the dashboard, and that’s just a query for the aggregation that I had just run. Next, let us, and it’s also possible to set up alerts for these queries for, for some condition. And here I have, I have an alert set for the sum aggregate, which is greater than a million dollars, and I can add any number of users here for the emails for the alerts. So it’s very easy to set up. And if you have the dashboard on refresh on a regular refresh here, I don’t, I don’t have it set up on refresh just to show you this capability.
Now, once you have it on a refresh schedule, then you know, the status will be triggered based on, based on the refreshes that are happening to the dashboard. So that was just seeing how we can explore, use sequel analytics for just exploration. Now let’s look at the endpoints that are, that are available. For the end points here, there are two end points running. And as I mentioned before, these are t-shirt sized, large and medium, and to create an idea it’s very simple. There’s not a lot. You have to do just the name, t-shirt size of the endpoint, the number of minutes of activity to auto stop the load balancing minimum and maximum, and then whether we need photon on or off, photon needs that vectorized query engine that we just talked about. And of course the spot instance policies, et cetera.
Now, once we have, once we have the endpoint, let’s look at the connection details for this. It comes with this URL. So this is the URL that you would use for connecting it from Tableau or Power BI or any BI tool that you need. Now, the end point also has, there is also a SQL endpoint API. You could use that endpoint to API to actually trigger, automate the endpoint and execute queries from, from your applications. The other, some metrics that have comes with monitoring metrics that comes with are, the number of queries, the peak- peak running queries and the peak acute queries, here we do not have a lot going on it isn’t, it is a sharing shared cluster; so it’s not just the load that we just ran. And it also has the number of clusters that the end point is using. Here it’s not more than one- we had it set to mean of one and max of four, but it looks like it’s just using one cluster.
So these are some of the, some of the, and these are some of them, some of the features that we just discussed about SQL- SQL analytics and throughout the summit, that our talks on each of these, we’re not going to dive into that now because we still, we have to discuss the use case that we tested it for. And there are plenty of talks that do a deep dive on each of these components. So, so please take a look at those. So this is the cluster level mapping for each of the t-shirt sized endpoint.
Let’s look at some additional resources. There are a lot of additional resources available on the Databricks website. We can also accelerate your BI use case on the Databricks’ Lakehouse Platform with our SQL Analytics, MVP customer success offering, which is coming in Q2. For each one of those components that I mentioned and I just showed, they had a lot of related talks that are happening in the summit. So we will, if you have missed these before, that will be available on our virtual platform, I think for a month or so. And now to tell us all about the telemetric use case that we worked at Comcast, I handed off to my co presenter for today, Suraj Nesamani, After you Suraj. Thank you everyone.
Sure Nesamani: Thanks Molly. Hello everyone. Thanks for being here. Before I start, let me introduce myself. My name is Suraj Nesamani. I’m a principal engineer working for advocating at Comcast. I have 15 years of experience in engineering, mostly specializing in RDK telemetrics and big data analytics. I lead a team of analysts and engineers who are responsible for building RDK platform, analytics platform. I specialize in RDK telemetry analytics, why they happen as index pod recommendation engines, and of course handling petabytes of scaler datasets. So let’s now talk about RDK [Inaudible] So RDK stands for the Rutgers Descent Kit, and is an open source format that is pre bundled with any devices that is part of your home that comes with Comcast [Inaudible] in it. So be it a video broadband device or IOT devices. To give you an example, if you’re on a phone with a service tech, because you set your set up box or your wifi gateway- gateway is not working, and they are sending a radio signal to a center box.
This is the phone where that- they are in, that we are interacting with. In early days, Comcast created this specification, but now it’s open source and it’s very popular in media industry, as well as in the telemetry industry, almost all devices that has RDK has, has this capability of doing telemetry. And we have close to 17 million devices and is installed to north America. And we do a lot of ETL aggregations on this Telemetry data on RDK devices. And we, we develop dashboards with this data, which is used for business justifications and this addition making and rolling out the former. Now let’s talk about RDK telemetry and data pipeline architecture. Any device, which comes pre-built, which are with RTK firmware is called- we call it an article device. We have close to 17 million devices, which is distributed across the country. And we have close to 24 terabytes of data per day, that we collect from these devices.
As the word says, telemetric is nothing but data collected by the device itself. I indicate device sense, telemetry events, every 15 minutes, the data is collected up in the cloud. Now let’s see how the data is collected in the cloud. RDK Telemetry [inaudible] Data is transmitted through history TP. We use Apache flow to collect the incoming data. And once the data is collected in the flu note, we send it to CAF Apache Kafka for real-time streaming, as well as to Splunk to do some [inaudible] queries, like [inaudible] queries, to see how the data looks or whatever we want to see on the data. Well, now let’s see how we process the data. We have this data in real time on Kafka. So we streamed as data, we process it to Spark EMARS, and we generate formatted parquet files. And these files are stored in S3, which is our storage layer.
Then we push this data onto S3 and for, to Redshift, and we run another aggregate queries on them to, to CA to, to get the aggregated data that is used for business decisions on dashboards or to any MLR analytics. Now, let’s talk, let’s see how our Redshift cluster looks. So our cluster, our active cluster is 12 node cluster with which a DC to eight X large notes on it. We executed over a thousand queries a day, and the CPU usage is pretty much heavy. Most of the part, but given the fact that you see the number there, the amount of data we have with us, and we use video heavy joins to collect data, to, to get the data, to get the aggregated data. So give an example, say if you want to see a film called, called Kali spiking coloring, the, the, the customer started to call or we get, we got, we started getting a lot of calls and we wanted to see, okay, which area or which region the calls is coming from.
So now you can imagine, I, I have this filler data I have this telemetry data. Then I have this account data and I have to join the whole thing and see where the problem is lying. And this, this keeps the classroom busy, always, but we build this part five years ago. And those days it was cutting-edge technology that we can do best without for our data. So it been working well, but it’s been, we were like it’s time to explore new ideas and see better ways to see, okay, how better we can- we can explore the new technologies out in the market. Well, let’s see how the applause and cons of the Redshift clusters. It’s, it’s very easy in ingesting the data and the storage, the English in the storage was easy. Just add the notes. And the Redshift is very good in handling complex queries, it’s robust, but we had challenges in storing data for longer duration, because then you have to spin up more clusters, which was expensive.
And the price was another main concept when we started spinning up more clusters. Now, the fact that the compute and storage are tightly coupled gave rise to some challenges we had, we have the provision and pay for the peak loads and data. And that became expensive as the data sets grew during that day. And then these are like a static clusters that sit there. So we had the, the everyday morning, they, we need some business dashboards to be populated and get, they make decisions on that whole day. So I, we have to have the data populated no matter what. And when we have, we have to manage the cluster so much that the CPU spikes out, it might go, we get a box, we have a workload management that we set. So we get about, we get the query a box, and that, that makes the dashboard look empty and the teams are stuck there.
So no matter what we have to populate this data, and we are fighting against this TCP CPU and the stories and which query to run first, who to give more preference, so on and so forth. So we were exploring alternatives and we were looking at various, various other soft stuff. And that’s when we started working with Databricks a couple of years ago. All right, it’s been a very progressive relationship so far, we migrated a few of the complex Redshift pipelines using spark 3 Data and Databricks, which works really well. And we also might get some of the EMR workloads to our Databricks platform. And we also work with the Databricks on some query optimizations. And we, we, we are, we, we, we took some database training, which was very useful. We updated our, we upgraded to too a new version of the data of IX platform, which was more secure, scalable, and simpler to manage. Described earlier, our request was a request to get like faster, cheaper, and more efficient way to handle analytic pipeline was answered.
When I saw this picture in Lakehouse architecture slides, this made a lot of sense for our platform. And we wanted to give, to try; and Databricks enabled SQL Analytics, but private preview in our workspace. Why not give it a shot because it’s anyways in a private preview and we are privileged to give it a try. For setting the scope for the test and land, we did, we took the 10 slowest query, and they heaviest query that we always get into problems in that shift, and problems in like timing out all that we had. They picked up the CPU usage so we took the 10 worst queries you can save for that. We have in Redshift. And, and we, we, when we thought, okay, we execute the same query in SQL Analytics and see how it goes. It was pretty hard to, it was pretty hard to test queries individually in production environment.
Let me set the expectations. So since we have other queries running all the time in production environment, mimicking the same environment and trying how fast it runs, it is very expensive to spend the same set, the same kind of cluster. So it’s not an apple to apple comparison so as to say, as we ran these queries, we, we took our Vista for, to keep the CPU idle. When we execute this query, it’s not an apples to apples comparison. But the closest we to we could do was to get a general ballpark number by running no book jobs on Databricks. We also wanted to get a comparison of running it on data Databricks as is without converting into any format parquet format.
We had lot allocated two to three weeks for a test online, and we decided to use Databricks internal metal stove. As most of our data is in S3. We also had time constraints for testing this out. We needed admin access to create data [inaudible] analytics endpoints. We did not well too much on ACS for various queries. We are to get it up and running very fast and see how everything worked and performed. The first task was convert the spark SQL queries, 90 percentage of convert like straightforward. We had to come up with creative workarounds for the rest. We had a general idea on the node size because we ran the notebook jobs. We tested the SQL Analytics, extra large and point. We have seen benefits of Delta in the past. Plus we wanted to see how it perform this photon. So we tested against a photon as well.
So next slide there, you can see the cells. So it was pretty amazing to see the results. The workload that we tested were broad range. The ones that performed really well were the ones with aggregations and complex queries. So for example, if you see the party two, we’ve been joining a data set, which has like 52 billion rows against 3 million rows. And the one with a lot of regular expressions has modular difference with, or without photon. And the observations that we saw was the SQL Analytics [inaudible] preview when we, it wasn’t prior to when we ran our POC, the need to SQL interface was very intuitive and easy to use. Creating end point is extremely simplified. It helps SQL Analytics to, to a great ex analyst. We’re using it to a great extent. As of now, analytics does not support does not have support for UDS. We did not test ACL’s too much, but it seemed simple enough though.
I centralized catalog would be really nice and we are looking forward to other, with the results, the POC, which was pretty exciting for us, but we also have multiple Databricks workspace that we work with. Most of our data is in S3. We currently do not use a catalog, but it looks like Databricks has a solution for that too. We call it unity catalog. This feature, it looks very promising and we are looking forward to using it and see how it goes. We are still at the early stage of experimenting with this Lakehouse and how it fits to our needs, but this project was exciting enough to share with you all. Thank you.
Suraj Nesamani is the Principal Engineer at Comcast specializing in RDK Analytics. He is the creator and lead developer of RDK dashboards that power used for making business decisions such as firmware...
Molly Nagamuthu is a Senior Resident Solutions Architect at Databricks. She has been working their top-tier strategic customers solving some of the toughest Big Data problems at scale in both Healthca...