Takeda’s Plasma Derived Therapies (PDT) business unit has recently embarked on a project to use Spark Streaming on Databricks to empower how they deliver value to their Plasma Donation centers. As patients come in and interface without clinics, we store and track all of the patient interactions in real time and deliver outputs and results based on said interactions. The current problem with our existing architecture is that it is very expensive to maintain and has an unsustainable number of failure points. Spark Streaming is essential for allowing this use case because it allows for a more robust ETL pipeline. With Spark Streaming, we are able to replace our existing ETL processes (that are based on Lamdbas, step functions, triggered jobs, etc) into a purely stream driven architecture.
Data is brought into our s3 raw layer as a large set of CSV files through AWS DMS and Informatica IICS as these services bring data from on-prem systems into our cloud layer. We have a stream currently running which takes these raw files up and merges them into Delta tables established in the bronze/stage layer. We are using AWS Glue as the metadata provider for all of these operations. From the stage layer, we have another set of streams using the stage Delta tables as their source, which transform and conduct stream to stream lookups before writing the enriched records into RDS (silver/prod layer). Once the data has been merged into RDS we have a DMS task which lifts the data back into S3 as CSV files. We have a small intermediary stream which merge these CSV files into corresponding delta tables, from which we have our gold/analytic streams. The on-prem systems are able to speak to the silver layer and allow for the near real-time latency that our patient care centers require.
Arnav Chaudhary: Hi all. It’s a pleasure to be presenting in front of this audience virtually today. I’m Arnav Chaudhary and I work as a Data Science Product Manager at Takeda Pharmaceuticals.
I’m joined here today with my colleagues, Jonathan Yee and Jeff Cubeta, to describe how Takeda has been able to empower data and analytics within our plasma-derived therapies business unit through the use of Databricks and specifically Spark Structured Streaming on Databricks.
For background, Takeda is a global research and development biopharma leader, based out of Japan, with significant presence across the United States and Europe. Takeda has its enterprises rooted in its deep commitment to delivering life-transforming treatments with a specific focus on oncology, hematology, and rare disease, neuroscience, gastroenterology, vaccines, and plasma-derived therapies.
One of the core initiatives at an enterprise level for Takeda is digital transformation with a strong focus on migration to Cloud, combined with fostering a solid foundation for data and analytics acceleration.
So, to kind of achieve this goal, Takeda has built a global data platform known as the Enterprise Data Backbone or the EDB for short. This cloud native platform exists to combine Takeda’s global data assets into a single source of truth that enables our analytics community to tackle the difficult problems that they face through a suite of advanced tools.
One of the core tools that Takeda has deployed is Databricks on AWS and it’s used heavily at Takeda across a number of domains, including data ingestion, data processing, and advanced analytics. Databricks as a tool, enables users to use Python, R, and SPARK for standard machine learning use cases. Databricks also enables rapid business intelligence use cases through robust connections to BI tools like Tableau, Power BI and Qlik. Moreover, Databricks allows Cloud native VTLS scale with our integration with Informatica and third, the spark native platform that Databricks offers. Specifically for this use case, we’ll be covering how we were able to use Spark Structured Streaming on Databricks to achieve our needs.
We have deployed Databricks across three primary regions globally, the U.S., Europe and Japan, where we host a DevTest and production workspace in each of these regions. We also maintain a specialized Databricks instance for our ongoing collaboration with MIT to enable their research teams to work on bioinformatic use cases against [inaudible] EDP has over 600 monthly active users leveraging over 200,000 GBU hours of compute per month. We have over 50 validated schemas exposed to our data catalog with hundreds of qualified tables for our 15+ analytic teams across our global data community.
So at this quick overview of Databricks at Takeda complete, I’ll pass over the mic to Jonathan Yee to discuss how the use case at hand specifically empowered analytics acceleration within our PDT business unit.
Jonathan E. Yee: Great, thanks Arnav.
Hello everyone. My name is Jonathan from EY, I’m an executive in our data and analytics practice. Next I’d like to provide you with some context around our PDT program. PDT stands for Takeda’s plasma derived therapies business. On this slide, I’m going to give a high-level review of how EY, Algernon and Qaeda have collaborated to drive business insights through our PDT analytics program.
As you can see on the left side, we have four high level business objectives that we set out to sell for. Increase access to greater volume of plasma donors. Two, to drive improved plasma yield. Three, reduced cost per leader, and four, harvest the value of PDTs data assets.
Shown on the right are expected business outcomes. One, is to increase share of the donor market to reduce cost per leader. Two, is to increase the yield by improving retention and the conversion funnel. Three is reduce manual processes through automation to improve efficiency and last, to improve upon data analytics and process layers for the PDT organization.
On the next slide, I’ll be providing an overview of the challenges and where we are headed in the PDT program. So as you can see here, we have some previous challenges that you can see on the left side here, and these are challenges with PDT data included over 150 disparate data sources, lots of manual work, lack of real-time information and reactive decision-making.
So what did we set out to build with our program? We built upon an AWS cloud foundation powered by Databricks for pipelines with Takeda’s EDB or Enterprise Data Backbone, which Arnav was just telling us about. We developed key business insights using power BI and our solution supports analytics, operational use, and other product data needs.
On the right side here, you can see our future state benefits and those include consolidated data near real-time data, data lake to store large volumes of information. It enables data science capabilities and it reduces manual efforts. So next I’m going to pass the mic over to Jeff Cubeta from Algernon, who will lead us through the technical portion of our content.
Jeff Cuberta: Thanks very much, John and Arnav.
As was described it’s my opportunity now to describe this sort of technical grit of how we accomplished what we’ve described earlier. So the first key things that we did when we came to the projects, we sat with the users of the current Takeda data structure. We wanted to know what the pain points were, what hurt and what was not good, and their operations day to day.
We identified immediately three key things. Data isolation, that is data siloing, as well as data fragmentation, the same data entities stored under multiple schemas with multiple keying structures. The latency of the analytics, the data was in a traditional data warehouse was processed batch jobs nightly, and it simply was not fast enough to fuel the operational needs of such a diverse and fast moving operational environment. The final one was the narrow audience.
So while there were beautiful and deep insights and really powerful analytic capacities, the ability to deliver those to the broadest possible audience, it was narrow and that was basically limiting the availability of the data and limiting the availability of others to operationalize and act on the data.
So with these three pain points in place, we said, we need to find a way that’s going to holistically address all of them and specifically we needed to find a way that was foundational, a way that could lay a ground base upon which others could build. We knew that the ground-base had to be extensible and it had to be accessible.
In pursuit of that, we accomplished the bottom three things [inaudible] what we consider our key design objectives, goals, and ultimately products. The unified data schema, the lake house model, and the use of real-time data. We’ll step through each of these in term, but these three components are what we feel powered this engine and what powers the PDT analytics program as they move forward into a new generation of product and drive.
The first thing we want to talk about is the data schema. You can see here that we’ve got this mock system that we’ve set up. In truth, there are four enterprise data systems at Takeda spread out across, as we’ve mentioned, over 150 operational collection centers. In our example, here, we have the collection center system and individual database operational frame at the local clinic, as well as the donor registration portal, which is a website and online presence that everyone can log in to, register themselves or to show their interests, name, address, et cetera.
The problem though is that you now have the donor, a single person, present in two different places. You have them when they’re registered and you have them when they show up at the center for their donations, which means now that you’ve got a personality crisis, identity crisis. You have two representations of the same person, and this is disruptive to analytics. It’s disruptive to your ability to keep track of people and it’s disruptive to your ability to drive behavior and operations out of that data.
So we needed to splice these rows together and not just splice in a one-to-one relationship. And we had to splice in a one to many, in fact, a many to many relationship. As you can see here, the donor demographic table is derived from the TBL donor table as is the donor table in our hub layer. That is because as these applications grew organically and naturally, data entities began to co-mingle and SQL tables.
When we decided, dug and really understood the business use case for the data, we realized that we needed to split them apart, that we needed to have a full structured schema that allowed us to identify all these entities, whether they were business related, whether they were operationally related, even if they’re analytically related and that’s what we’ve done here. You see that we fork and join the tables and in that way, we can create an arbitrary number of future state objects from an arbitrary number of original state objects.
Looking at this, you may be inclined to think good gracious, the code base here must be gargantuan! You have 200+ source tables and each one is a many to many relationship to other places. The code itself must be incredibly complicated.
In fact it’s not and that was because at the very beginning, we decided that we were going to have a configuration driven process. That is the ETL pipe would remain constant, but we would modify the config objects to cause the engine to behave differently based on what data is processing. So you see here, these are some snippets out of an example, config object. We’ve made it up, it’s about office chairs, but the concept remains the same.
We have our stereotype transformations, the merging, how do we persist this forward into our data structure and our enrichments, our lookups, how do we find data from other parts of our structure to bring it all together?
This is really cool because you don’t need to know Python to write this. You just need to be able to type Jason into a schema enforced editor, which means that we’ve been able to effectively decouple the data exploration and the data development from the pipeline development. We can keep them separate, which means that both groups will move faster. It also accomplishes one of our main objective of [inaudible] from a foundational perspective. This is easy to extend and to grow on.
You could see now how this would apply. Each one of these arrows representing one discrete config and allows us to process this data from arbitrary sources into arbitrary targets.
We talked about the lake house model, for those who are not familiar with this term, it is a data warehouse with its repeatability and its structure and its querying capacity, but run on a data lake, allowing us to leverage the power, the low cost and the extensibility of a traditional data lake.
You see here that this fall is a pattern that probably is familiar to most of our audience. We have our various layers moving through, we have our sources on the left-hand side there and our consumers here on the right. A couple of interesting points that I feel are worth bringing up: our uniform ingestion platform, again, wanting to decouple and make our process extensible, we decided that we were going to remove the dependency on the streams from any specific ETL or extraction technology.
So whether it’s Informatica, whether it’s DMS, whether you’re making HTTP calls through postmen, whether it’s manual file drops.
So long as it can get into the landing zone, the config objects will find them and they will transmit them through the pipe. Again, decoupling, allowing for extensibility. By putting it in S3, we improve the accessibility of the data, S3 is sort of a universal zone, everyone can get there. Structured streams between the layers allow us to implement this real time. A neat feature here is that we’re serving the data in a couple of different places because we have different use cases, we have different needs. So for example, our data scientists, our model developers up there at the top, they need to have the data in bulk. They need the full terabyte of data and they want it in S3, they want to know that spark clusters can hit repeatedly and predictably to drive these advanced analytics.
But at the same time for the exact same data, we have people who don’t care about the whole show, they want row level picks. They want the ability to take a single entry and act on it. You see those actors along the bottom here, you see the operational executives, the center managers, they only care about the data relative to their data center, their operational center.
Marketing operators are concerned about very specific parts of the data. The donors, this is really cool, we are able to see how we have these consumers kind of ringing around our data. It’s because we’ve improved the accessibility of the data. Everyone at Takeda is able to gain these insights, including the customer, the client, the donors themselves are able to get to their data and to make good informed decisions based on it.
“When is the best time for me to come in for an appointment?”, “Are there promotions I need to be taking care of?”. These are things that weren’t possible before and by standardizing and improving the accessibility of our data. These are now possible.
We’re able to drive dashboards like you see on the corner, drive web applications, chatbots, blockchain. By making the platform open and accessible, we’ve been able to accomplish these things. We’ve also been able to retain legacy tools, such as SQL based queries, SQL based connectors for power BI or other BI tools. And it’s extensible. We want to serve it out of a different layer, into a different serving technology. “The database was great, but now we are all about no SQL.”
That’s fine, we can serve into no SQL the exact same way we can serve into RDS, and that means that other developers can take what we have and run with it, which is the whole point of laying a solid foundation.
So this is the part where we talk about how did we get this data to be real time? So how did we get from batch jobs all the way into actual real time data? So here, we have an example. This is not a true case, but it’s representative of something that we do have, we have our payment systems. So the donors are compensated for their time and for their donation, that’s a proper enterprise grade system.
You have users interacting with it, making payments, filling, debit cards, transactions, processing at the bank. And all of that is live in the payment application, in their data stores. Through a combination of ETL tools and our spark ingestion, the landing zone, we get the CDC rose. Now this is really cool because we actually get the full CDC. So we get the transaction, the whole row and all the things that happened to it.
You’ll notice that we have this main line through the middle, which is append only. That’s because we’re preserving the full transactional history of each row so that we can replay them and so that we can very quickly roll back if we need to.
But we don’t want to have, for example, in a joint, as we see down in the bottom right, we don’t want to join to the entire transactional history of every row that’s not performing. Consumers of the data when they query it, they don’t want to see what is the full state of my data?. Donors don’t care, what did the payment look like in the past? It’s what is the state of it now? And so we serve the CDC, we apply the CDC into what we’ve tagged here is our serving layers into the SQL server, the Delta tables. And this allows us to preserve with ease, the full transactional history while also presenting the data in the way that we want it.
We also have this ability to serve into arbitrary locations that we’ve mentioned before. And that’s because we have in sort of a agnostic way, the full transaction. So let’s take a look, dive into some code and we can show you exactly how we accomplish this sort of forking of the streams.
So we use this for each batch method to fork and serve off of Delta tables. And what a foreachBatch is going to do is, on a data stream, which is an unbounded forward and backwards, it will micro batch the stream as it processes it. And each of those micro batches gets handed to our discrete function. Within that function, we can then diversify and we can sync to wherever we want to go. Because we are applying CDC, we’re functionally up-serting as well as deleting.
We can perform these up-serts again, start target Delta table. You’ll see that on the right hand side, how we build this merge object and we are able to say, no, if it is a [inaudible] if it’s not, we want to do this, we want to do that.
We can also do the same thing for our SQL database. We can execute JDBC commands against it, and we can up-sert these micro batches. We can also drive this to event bridge, which then all of a sudden opens up whole avenues of opportunity because now the data is jumping out of our data lake, it’s actually available for consumption. We have event streams pouring off of our data lake and we can do a whole bunch of stuff with them. We can send them to [inaudible] topics, we can send them to SQS queues, we can drop them into rules, we can configure listeners for them.
You get a new donor, donor record, insert donor for the first time, SMS. We want to send a message out to them: “Hey, we’re so glad you’re here with us. How can we help? We want to make your engagement with our product easy and simple”.
We have the extensibility to make that happen now because we have these batches streaming out to, essentially, arbitrary sinks. And I think that’s awesome because it allows us to take the data from where it was, it was kind of sequestered, it was kind of cryptic, it was not available. And we’ve now effectively broken down all those silos and we’ve dismantled those barriers and being able to serve the data through these means is fundamental to making it accessible.
So we can kind of step back here. Now you can see, and you can imagine quite easily, where would you go from here? We have our serving layers, but really the development that’s going to happen next is the really exciting stuff. Do we want to serve it into new platforms? We want to break out of AWS and go, other places? We can do that, the data is available. The data is accessible and the data is operationally relevant and these are all things that we didn’t have before. We wanted to have a system that was truly foundational and we feel that based on what we’ve seen here, that we’ve accomplished that.
I’d like to thank you very much for the opportunity to speak with you. It’s been my honor and privilege. We’ll do our best to answer any questions that you may have at this time, but thank you for this opportunity.
I work for Takeda Pharmaceuticals on the Enterprise Data Platform and Products team. The goal of our team is to enable global enterprise data capabilities across Data Engineering, Business Analytics, ...