Accelerate Your Data Engineering Pipeline Development on Databricks and Govern Your Delta Lakehouse

May 26, 2021 03:50 PM (PT)

Modern enterprises depend on trusted data for analytics, AI and data science to drive deeper insights and business value. Now with a cloud-native approach for building data pipelines into and within Delta, customers can use Informatica’s leading Intelligent Data Cloud Management platform to leverage Databricks’ performance, scalability and reliability to accelerate innovation for your business. In this session we will discuss how customers can modernize their on-premises Data Lake and Data Warehouse to a modern architecture centered on Databricks Delta that not only increases developer productivity but also enables data governance for data science and analytics.

In this session watch:
Rodrigo Sanchez Bredee, Sr. Director, Strategic Ecosystems, Informatica



Rodrigo Sanchez…: Hello, and welcome to today’s session. My name is Rodrigo Sanchez Bredee. I am the Senior Director of Strategic Ecosystems at Informatica and I manage our global partnership with Databricks. Today, we will spend some time discussing how Informatica and Databricks compliment each other to help enterprises accelerate the development of data engineering pipelines. As well as provide a highly governed set of solutions for customers that can dramatically improve business outcomes while also reducing risk and increasing agility. In other words, we will talk about, how do you unlock the potential of the Databricks’ Lakehouses by leveraging Informatica data governance. One minor housekeeping item before we dive into it, please do type your questions into the Q&A box provided. We will either review them in writing as we go along or answer them at the end of the session. Today’s session covers quite a bit of content. And, so please make sure that you do enter your questions as they arise in the presentations. And we’ll try to cover…respond to as many of them as we can.
For the agenda today, we’re going to talk a little bit about what are some of the barriers that we’re seeing in our joint customer base to adopting AI machine learning and analytics projects, discuss a little bit what we mean by modernization towards cloud, discuss our partnership with Databricks. Finally, talk about some of the critical success factors for some of your data science and AI and ML projects, and finally discuss a specific customer story. One last thing before I jump into it is that I will be talking about things that we plan to do in the future, releases and things like that. And of course, the hardest thing about the future is to predict it.
And so just bear in mind that, things may change as circumstances change, et cetera. This is just meant to be an overview for informational purposes only. So why are we all here? We’re all here because we know very well that the lifeblood of analytics, of AI and ML, is data. This is the Data + AI Summit. I’m covering… Hallowed ground, except that, as you guys know, some interesting figures that you see here in the chart is… The volumes of data of course, are growing dramatically, a 60% [inaudible] growth rate is an incredible number. We also see of course that with that massive fire hose of data coming your way, a lot of it just goes unused, right?
It’s what Splunk calls dark data, which is data that’s being produced, there are some insights in it somewhere, but it’s just going unused because of ‘A’ the sheer volume of it, because it’s hard to get because it’s siloed, or because it’s in different formats, et cetera. And that also points to a slightly more insidious problem, which is that you may not know where your sensitive data is actually stored, which itself can be a humongous problem. If you think about 50% of our guys are saying, “I don’t even know where my sensitive data is.” That’s actually a staggering statistic. And also, a lot of companies report that they are confronted by issues of just data quality and complexity, right? So yes, I have all this data. I really want to use it, but it turns out that when I said my analysts on it, when I set my data scientists on it, they actually spend the majority of their time combing through it… Sort of aligning it to the use case that I need, making sure that it’s of the proper quality, that it’s coherent, et cetera. And that’s a lot of wasted time.
And finally, at the same time, you have this sort of countervailing force where every company out there, almost every modern company out there, wants to make sure that they have the ability to essentially inject more and more data into how they make decisions. So, 80% of corporations, as I mentioned before, they’re saying, “Hey, we actually want to enable our… As many employees and stakeholders in our company, or of our company, to have access to this data, to inject the data into how they make decisions, how they do the daily work.” When you pair that with all the other challenges here creates a very interesting, very difficult [inaudible] to solve.
And so, a lot of companies when they are faced with these problems are deciding to go to cloud. And essentially they want to modernize their infrastructure to enable them to essentially resolve all decisions that we talked about before. Right? So you want to have the ability to do your data engineering, manage your data warehouse, or close, move all your data into cloud-based data lakes, integrate your applications with cloud data, have them run in the cloud. And ultimately, also make sure that you have your data science programs running in the cloud. And the common thread across all of these, of course, is integration. And together, Informatica and Databricks provide you the best in class capabilities to service all these different workloads, in the cloud, across all the [inaudible 00:04:40] infrastructure providers. So you want to be have the ability to very quickly and easily ingest data into the Delta Lake, from as many sources as possible.
Informatica has made significant investments over the last few years to make sure that you have the ability to consume data from things like streaming sources. So your IoT clouds, your weblogs, your REST APIs, also have the ability to consume data from databases. So you want to have the ability to create a snapshot of the source database, load it into Delta, and then as well have the ability to drive incremental changes and have those reflected into your logical ending areas inside Delta. And also of course, the ability to move files into Delta. So we see a lot of legacy applications that throw things like CSV documents or CSV files that you want to be able to very quickly and reliably load into Delta. And again, into these logical areas that a lot of customers use for data loading. And once the data is in Delta, of course, you want to accelerate your analytics.
So you want to leverage things like… The fact that you have high assurance of the consistency of the data through Delta Lake’s ACID capabilities, as well as schema enforcement and other enhancements to really drive better analytics of the data that you’ve been loading. And you can also choose to use Informatica to… As you improve the quality of the data set, as you prepare the data for either BI analytics or for machine learning type of workloads, or to make it available for an application… To use as reference, et cetera. Both Informatica and Databricks together help you achieve these things. And ultimately, all of this is built on trust. And so what you want to have, of course, is the ability to have a bird’s-eye view if you will, of everything that is going into your Delta Lake, how the data is moving inside of your Delta Lake, how it’s actually flowing inside the Delta Lake.
And we have a very deep integration with Delta, with our data catalog that I was talking about, but, in a minute. So what are some of those critical success factors for AI analytics? You want to start out of course, with the cloud. So everything that you do has to be cloud ready. Very few companies are investing in, on premises infrastructures. Everybody of course, I don’t have to tell you the benefits of cloud. We spoke about that before. But you also want and make sure that you have the ability to do things in a way that you don’t carry some of the disadvantages of the on-premise world. For example, you want to have the ability to build your data engineering pipelines, with as little coding as possible. You also want to make sure that you have as little operation overhead as possible, and more importantly, you want to have no limits on data.
So with Informatica and Databricks together, of course, we provide the codeless UI for you to build your data engineering pipelines. We provide you with the ability to manage the life cycle, Databricks… Jobs clusters, and upcoming SQL analytics clusters, so that you don’t have to worry about operating anything in terms of the infrastructure. And ultimately, Databricks brings you no limits on data, right? So the wonderful thing about the Databricks Delta Engine, of course, is the ability to have essentially infinite elasticity, the ability to run… Scale your jobs up and down as much as you need, et cetera. And all of that essentially makes you ready for AI and ML workloads as well as advanced analytics. So let’s talk about each one of these elements in detail. Of course, Informatica we’ve been in actually in cloud for many years, a lot of customers don’t know this. We actually have a cloud native data integration product called Cloud Integration as it happens, which is part of our integration platform as a service or Informatica Intelligent Data Cloud Management.
And in there, of course you have the ability, for those of you familiar with Informatica already, you have the ability to essentially design your data flows in the form of mappings, which is that pictures you see on the right, above on the right. You build your logic in the form of mappings. We actually out of the box, have connectivity to over 200 sources that you can connect to and then move the data into the Delta. And one of the future releases is going to include the ability to actually create your pipelines in that type of utilization, and then push it down into data [inaudible], which is the other image that you hear, which is the ability to create workflows. Where essentially you run jobs sequentially and you can spin the [inaudible] job clusters up and down if necessary, if you do some job clusters. Of course, none of that applies as much with SQL analytics.
And what we do essentially is we apply a lot of AI to the problem of… How do you convert all this business logic into code or into SQL expressions that can be consumed by either the SQL analytics service in the future or today via jobs clusters in Scala endpoints. And the way, essentially, our cloud product works, and for again, for those of you not familiar with this is that, we actually host a multi-tenant environment with that host group control thing. This is where you do your designs, the box that you see here, marked as IICS. This is where you do your designs, we store all your metadata, we have a repository. We manage all these things on your behalf. And you essentially federate the runtime to whatever makes the most sense for the use case, right?
So you have some legacy on premise infrastructure that you’re trying to connect to and move that data up into Delta Lake. That runtime happens over there. And the secure agent that you see listed here in the lower left, that itself is a microservices architecture. So all the different Informatica services… Run there locally, depending on the use case. So if all you’re trying to do is move the data from on premises to Delta Lake, we run what we call the mass ingestion service. If you’re trying to actually maybe do some more traditional ETL and maybe move the data [inaudible] factors into Delta, we can do that as well and have the jobs run in the secure agent. But your data may also be residing in the upper left on some cloud infrastructure, Salesforce, or maybe you have a [inaudible] and you want to essentially have the agent live locally next to those applications that are throwing all this data and then move the data into Delta Lake using native connectivity.
And again, what we are announcing and we will have available for customers… Second half of the year, is the ability to also essentially exercise pushdown optimization. So the idea is that you build your data pipelines using Informatica, and then you push down the capability, you essentially execute the workloads using the Delta Engine, if the data is already in Delta. And from there, of course, the data can go to either BI type of use cases for machine learning, et cetera. And the foundation for all of this is our Enterprise Data Catalog that gives you discoverability and visualization of the data as it’s flowing. And which of course also enables you to have things like data governance, data protection, and security for your data.
So, one of the advantages of Informatica of course, is that you don’t have to code anything. So what this chart at the upper left says is very typical use case. And essentially you can see two sources to target. You’re doing some joining, some aggregation. And you can see here how the equivalent SQL query below, or even trying to even write some scholar code there on the right. ‘A’ this is a lot easier to debug. It’s a lot easier to visualize and more importantly you are… For example, just as Databricks is moving from… Or expanding their clients from Scala into SQL, guess what? Your mappings will work with both depending on what you’re trying to achieve, right? That’s the vision. The next step of course, is metadata management. As I mentioned before, that’s the foundation. It’s what allows you to essentially discover the data, end to end. So what’s inside of Delta, as well as upstream and downstream of Delta. It essentially gives you the… Think of it like a Google search for your data.
It’s not quite that it’s actually quite a bit more than that. But essentially we go out there and we pre-index all the data that you have, not just the technical metadata, but also things like affinity or similarity between datasets. Or, we also profile the data and say… Try to give you quality discourse in the data. All these information becomes available in the Catalog and there your different data consumers can collaborate and help each other in terms of understanding, for example, the suitability of a certain data set for a certain use case, et cetera. So it’s both a way to have the visibility as well as enable collaboration between data teams.
Which itself is the next step for that is of course data governance. So on top of this data catalog, your data stewards, your… That officer… The people in the company that are worried about your data strategies, can define their data management policies, can define the business glossaries for the company and make sure that they establish how they want the data to look. And then convert that into data quality rules that also execute locally on Databricks and have the sort of that both bottoms up view provided by the catalog, as well as the tops-down mechanism of making sure that you have your data governance element, and your data governance practices in place.
And the other element that glues her altogether, of course, is our data quality. Which again, you apply your data qualities consistently. You can essentially visualize the contents of your data quality. And then implement rules, et cetera. And guess what? With our latest release, all of this can actually run natively inside Databricks. So again, you don’t have to… Your data doesn’t have to exit the data lake. It can just run locally as necessary. And so we differentiate across all stages, right? So just find and discover the data that you need, using the catalog… Et cetera, then moving of the data, no matter the source, out of the box we have connectivity to [inaudible], native connectivity to over 200 sources. And you can also have the ability to do both models of mass ingestion. Where you’re just moving the data in high volumes from A to B, but you also have the ability to think about, potentially, maybe transforming and data movement in a more traditional ETL flow and then loading that data into Databricks. And you can do that now in a cloud native form.
And that essentially allows you to also prepare and enrich the data before you start modeling without having to move the data outside of Delta. So you can essentially keep the data and the work inside of Delta. But still, your designs and the intelligence, the business models, they all stay in your sort of metadata layer with Informatica. And ultimately, this results in massive, massive productivity, right? Because, what you don’t want to have, is have the necessity of… And what this results into, of course, is massive productivity. Because, if you don’t have to rediscover the wheel every time, you have this very scalable mechanism of building your data pipelines and then have them execute on the massively scalable engine of Databricks, you really get the best of both worlds. And ultimately, again, you’re able to go fully serverless by using the data pipeline processing from Databricks.
We do have examples of customers that have talked publicly about their ability to do this. This is a company that actually had started their journey on premises. [inaudible], Takeda Pharmaceuticals. They had gone pretty deep down in the journey, and they’d realized that, “You know what, there’s a lot of managing overhead for us in this. There’s a lot of, of course… Things to be won from parallel processing and the ELT model, and having a data lake, but it’s really too complex. We need to move one step up.” So they decided to implement Databricks. And they have been successfully using Databricks at very high volumes, as a unified data analytics platform. They do use notebooks, particularly for data science type of use cases, but they’d really use Informatica when it comes to actually building those data engineering pipelines that feed into the Delta Lake.
And so you get the best of both worlds. You get the robustness, the repeatability, the auditability of building data packages using Informatica, as well as the ability to build… And have the high agility of building a data science use cases using notebooks. And all of that of course is visible through the data catalog. So what we want to do just to close is, again, Informatica and Databricks is to help you accelerate your development of data engineering pipelines. As well as have complete data governance so that you ultimately, again have the best of both worlds, high agility, as well as high control. I would invite you to get started. You can find out more at, of how we’re going to help accelerate your initiatives. And I would love to get your feedback. Please, don’t forget to rate and review the sessions. And let’s see if I have any leftover questions from the Q & A panel that I can answer for you. Thank you very much for your attention and have a good rest of your day. Bye.

Rodrigo Sanchez Bredee

Currently the Sr. Director of Strategic Ecosystems, Big Data, he is responsible for driving strategic alignment between Informatica and Databricks, among others. Previously Sr. Director, Product Manag...
Read more