Comcast is the largest cable and internet provider in the US, reaching more than 30 million customers, and continues to grow its presence in the EU with the acquisition of Sky. Over the last couple years, Comcast has shifted focus to the customer experience. For example, Comcast has rolled out our Flex device which allows for customers to stream content directly to their TVs without needing an additional cable subscription. With the shift in focus to customer experience, Comcast has made a concerted effort to continue to make data driven decisions to understand how customers interact with our products while continuing to innovate with new products and subscriptions. The Product Analytics & behavior science (PABS) team plays a crucial role as an interpreter, transforming data into consumable insights and providing these insights to the broader product teams within Comcast. The PABS team does this on the entire Product ecosystem including X1, XFi and their brand new Flex devices, which is one of the largest streaming platforms in the world and this ecosystem is responsible for generating data at a rate of more than 25TBs per day with over 3PBs of data being used for consumable insights. In order for the PABS team to be able to continue to drive consumable insights on massive data sets while still being able to control the amount of data being stored, the PABS team have been using Databricks and Databricks Delta Lake to do high current low latency read/writes in order to build reliable real-time data pipelines to deliver insights and also be able to do efficient deletes in a timely manner. Some of the features from delta that we took advantage of to achieve the desired levels of efficiencies, optimization and cost savings are:
– Hi, everybody, I’m Jim Forsythe from Comcast and today I’m here to talk to you about data driven decisions at scale.
Looking at our journey over the past 10 years has been filled with digital innovation, focused on making amazing experiences our customers love. And if you look from left to right, you can visibly see the changes we’ve made in our hardware and also the design of our software. As this COVID pandemic has forced us into social distancing, our internet service has become more transparent than ever and we depend on these services, not only for entertainment, but now also our professions. And many of you, as well as myself, are Comcast customers and we’re leveraging this internet service to attend this digital conference right now. Late last year, we launched a brand new service called Xfinity Flex, as an extension of our entertainment services included in our internet service at no charge, allowing you to watch all your favorite content, wherever that content may be, giving you a better way to stream. We add new apps to Flex all the time, making sure that all of your favorite content is available for you to watch and, of course, you can spend less time searching and more time watching with our amazing voice remote.
At Comcast, we truly believe data is essential to creating simple, easy, and awesome customer experiences. We use data to develop amazing products and inform our everyday decision. Data represents the voice of the customer at scale and we use this data to translate these bits of information into rich events and insights.
This data power is not only our AI services, the same data allows our organization to make more objective decisions based on facts and not opinions.
This is why data driven decisions at scale are so critical. And many of you are data people watching this conference right now. And you hold the power to enable your company and organizations to make data driven decisions. By not empowering these organizations and companies to make decisions based on facts, they will default back to opinions. And decisions based on opinions, we move away from the scientific principles we once instilled into our companies and organizations. And today, I’m gonna walk you through two major challenges that come up with making data driven decisions at scale.
The first is processing. Have and processing all this data at really big scale is quite a challenge and making sure that you continue to stay up with data and processing enables you to maintain and continue to build data available to your organization. And then, subsequently, this data needs to be made actionable. You need to increase the value of this data set by leveraging analytics, insights, and models to develop new products and services and then evaluate those products and services by running AB tests. So, let’s dive right in.
So, processing data at scale. This seems relatively simple, right? From left to right, we have all these raw events, we process them, and we store them in a data lake. But, as many of you know, there’s a lot of complexity behind these arrows and for us it really starts at the first step.
We have millions and millions of transactions every second. And all these data services really combine to create a really complex system that our customers engage with heavily. These millions and millions of transactions, we need to process, which are terabytes of information, and then also store intelligently within our data lake, enabling petabytes of data to be used. This is truly massive scale.
Looking at all of our services, our entertainment services create and generate a ton of data. Our X1 platform alone is used very heavily and now adding on to that an additional service of Flex just creates more customers and more data. These services depend on our internet service and as many of you are connected at home on that service right now, know that you have tons and tons of devices, all of which you just have an expectation you need to be connected.
And many of our customers use these services, also along with our mobile apps, concurrently. This generates lots of data and we need a place to tell our engineers to put this data to get it into our data platform.
So, the first step is, we need to collect all these events and we standardize on two different methodologies of streaming and batching. For streaming, we use Kinesis and Kafka because it provides us a highly durable and scalable way to stream our data into our platform. We also use S3, as it’s a great place to store data that we don’t need in near real time and we can then batch that into our system. A good tip for using S3 is creating expectations with notifications. You can then listen to these notifications of when either new files are updated, or the amount of data that’s been loaded since your last batch. You can then use this to evolve away from more traditional processes, like Kronos, that are dependent upon time and add complexity and/or better expectations of when you should set up your next process, not only just based on time, but also based on the number of files.
And the key to doing this is having consistent methodology of processing. And in my mind, this is the most important part. We use databricks to build our pipelines for which many people consider ETL functions, which we do not. We call these processes pipelines because we continuously enrich and optimize them to build better data products and services. In enriching all of these events that we process, and being a product team, many of these data elements are explicit events. And we work to develop these events into implicit context so that our teams and machines can leverage these events to make intelligent decisions. Many of these pipelines involve complex processes like sessionization. When you combine that with streaming, batching, joins, and multi-step enrichment processes, this turns into a very complex problem. But with the power of databricks, the problem is simpler. We can focus on code and the way that we orchestrate the events and jobs together and focus less on operations. This enables us to focus on delivering value to our organization and our customers and spend less time doing DevOps. But all this data needs to land somewhere. So, we move and we land all this data into our data lake, which we use Amazon Glue to manage our meta store and writing all this data out to S3.
The formats that we standardize are on Parquet for our standard data sets that don’t have a lot of complexity, or may not be that big in size. And we use Delta when things get complex and really big. Delta has empowered us to make really complex sessionization pipelines simpler. Much of the work that we had to do in the past that was custom, like partitioning buckets, optimizing jobs that run concurrently, all kind of go away with the simple configuration within databricks. Let’s walk through one of these pipelines right now.
As you can see as we move left to right, in building the sessionization pipeline, just starts with an ingestion job, where we’re doing a join, the streaming join and we processed it out to Delta and write it to S3. In doing that, you turn these small files into bigger files with optimize, allowing us to set up a more efficient process for job two, when the really complex sort order functions happen. In the past when we’ve done this, we had a lot of complexity. We had to break this job up into concurrent runs. With things like Delta, that eliminated these processes so that we could have one consistent processing job that allows us to run through and write this data out to S3, and then run enrichment processes in optimizations again to write this data out to our final S3 bucket, allowing us to empower our organizations to use this data to run analytics, develop new insights, and build new models. This is truly data processing at scale.
So, looking at the foundation of our data platform, we have our data lake, which is truly the foundation. This allows all of the teams to leverage data in a way that’s truly differentiated and empowers them to solve more interesting problems. But without this foundational layer, none of this is really possible.
Our analytics teams, our data science teams, and our AB testing platforms all depend on this data. And we orchestrate this in a way where all of this data is common to all of these systems. Analytics and consistent measurement is critical to a data driven journey. These metrics need to be aligned to goals. And we set up these goals and metrics as targets for our organizations to drive continuous improvement and change, making sure the people with the question are empowered to gain insights. And answers, independently with self-service tools, is critical to data driven decisions at scale. Otherwise, you rely on traditional analytical models, which these are overly dependent on people and ticketing cues to build metrics and reports.
These models, you spend little time analyzing. And we want to shift to a more democratized model that focuses on self-service. This optimized for time spent analyzing and gaining new insight, which drives requests for new data to be integrated into our system. Less time we spend building reports and analytics and more time for people creating them themselves. This is a really big shift and it’s really important to empowering your organization to be data driven, as you don’t want to be dependent on people to scale the ability of you to gain answers. And this really creates opportunities to discover and create new ideas and evaluate them.
And as we move into evaluation, we wanna think about data science teams and the problems that they focus on.
Something that we talk about when we talk about data science, we typically talk about the amount of time that data scientists have to spend on finding, cleaning, and prepping data.
And when you think about data science, you really put them right in between the problem set, and the data.
And what you really wanna do is empower the data scientists to be working on continuously evaluating data and the problem at hand. This enable us to solve customer problems and improve their experiences, giving them a massive data lake with incredibly rich data sets that can be used independently, or combined, to better understand the context of the problem and work with teams to build new solutions. The databricks environment allows data scientists to do their job interact with data more efficiently. The platform enables them to interact collaboratively in a work environment and also allows them to perform analytics using their favorite languages and libraries and run against small and enormous scale. Connecting data science to this with a vast array of data and powerful systems is really the key in solving some our toughest problems. Running experiments is the best way to get causation with high probability, assuming a properly designed experiment. Running online experiments is really powerful. And it’s a practice that we’ve been performing for centuries and that we can now take advantage of with the advancements of technology. I have a quote from Thomas Edison that I’d like to read. “This invention, like most inventions, “was accomplished with men guided “largely by their common sense and their past experiences, “taking advantage of whatever knowledge and news “should come their way, “willing to try many things that didn’t work, “but knowing just how to learn from failures “and build up gradually a basis of “facts, observations, and insights “that allow the occasional lucky guess. “Some would call it inspiration to affect success.” It’s so amazing and so simple and wonder why not many people do it today. And it’s something that we as a company are really leaning into and doing more of. What’s critical to running experiments, which many people call AB tests, is making sure that we have a common data platform to evaluate the experiments that we’re running. These are simply set up as a split in the control and treatment environment. And when they get to our analytical engine that analyzes the results, we wanna make sure that this data is common and that it’s the same data that we use for all of our analytics that our data scientists are using to develop their models, and then, lastly, also, we use to evaluate our AB tests. Looking at this altogether, we have a super rich data platform for where our data engineers are using this data to build really rich data sets: our analytics teams and tools to be leveraging you this data to gain new insights, our data scientists to be developing new models, and also additionally gaining complex insights and running different analyses. And then (mumbles), we can use this same data to run AB tests to get consistent evaluation and understand the value of our ideas before we launch them. And moving from data, to insights, to actions, our product organization can make big bets based on data. Our data science and engineering teams can use this data to improve the products and run AB tests, and our applied AI team can work to use this same data to improve their models and make better products for our customers.
Wrapping all this up, it allows us to unleash innovation with data. And data helps build the basis of facts, observations, and understandings, which help us improve the decisions we make and measure the impact of our decisions. In this feedback loop, in using data, fields feedback and provides a direct line of the customers voice. And if you invest the time to listen, this information will help you create amazing products and customer experiences.
amazing products and customer experiences.
Jim Forsythe leads the Product Analytics & Behavior Science (PABS) team for the Technology, Product and Xperience organization at Comcast where he is responsible for transforming bits of data into consumable, productive insights. Jim's day is challenged with building data pipelines, researching new ideas, developing key metrics and informing data-driven decision making. Prior to Comcast, Jim led data science teams for a fortune 500 management consulting firm. He specialized in large scale product analytics, cloud platforms, user behavior research and retention modeling for new product initiatives.