Decoupling Your Legacy On-Premise Data into Operationalized Lakehouses

Qlik’s solutions include Data Integration functionality for ingesting, applying, cataloging, and analytics. During this session Qlik will showcase the solutions being leveraged by our mutual client JB Hunt to load data into Delta, which included Replicate and Compose for Data Lake generation, Delta Merge statements, Cataloging the system, and querying data in Qlik Sense for Analytics.

Speakers: Jordan Martz and Ted Orme


– [Ted] Good morning everybody, and welcome to this session sponsored by Qlik. What are we looking to cover today is how Qlik can help your organization decouple your legacy on-premise data and bring it into an operationalized lakehouse. So my name is Ted Orme, I head up the Data Integration Strategy here here within Qlik. And I’m also joined by my colleague, Jordan. Jordan, welcome.

– [Jordan] Thanks Ted. My name is Jordan Martz. I’m a Principal Solutions Architect for Qlik.

– [Ted] So what are we’re gonna cover today is a couple of two parts in a head-to-head. I’m gonna go first and give a little bit of an overview of Qlik and how we’re helping organization and the vision who bringing their their data into their modern platforms today. And look at some of the trends and themes that we see in the market and how we bring that together into one unified platform, helping our customers today. And then I’m gonna hand over to Jordan. Who’s gonna do a look at more detail about a specific customer example and then look at the technology and show that in more detail. Jordan, do you wanna give a brief summary of what you’re gonna be showing today?

– [Jordan] Yes, we have a customer that’s in the trucking space. Incredible use case around how they’ve taken their mainframe and optimized that source. And then as we talk about the different sources and targets that we have transformed how we can then integrate tools like Databricks or Spark based data lakes is one of our key tenants to some of our success stories. This success story in particular was unique and how that level of automation has been able to take. I think it was about three months in totality to success. That was incredible. And the overall experience of a platform to support that was another part of that story. So Ted, why don’t we get started now and we’ll keep going? Perfect.

– [Ted] No worries. So just a little bit of housekeeping from our side of what we’re showing here today, and really like to look a little more detail then about who Qlik are and why we are and what we’re helping organizations today. So, Qlik have been around and have a large number of customers and global partners delivering on our solutions and offerings today. We’ve been recognized as an industry leader for many years both in the analytics and the data integration space. And really driving our innovation while solutions today, helping our customers around data and all components of that. And really you may be aware of Qlik just being a visualization and analytics tool. But really through Qlik boards and the acquisition of attunity last year, and many other components in the data integration space to bring together these pillars of the organization. So what is it that we do and how we help organizations? And how we help them is all around the challenge. The challenge around data. The organization struggle really to take what we call as actionable data and drive insight into that. And the different silos of data that organizations have and often make it difficult for business decision makers to feel as though they’ve got the right data to make those decisions that they need. Or maybe often the analytics platforms are not really part of their day-to-day business. And really this is where we can embed analytics and embed the right analytics into the business decision-making. So here we have, we are bringing this data together. How we’re helping organizations is bringing these three components together. These three components are data integration, data analytics, and data literacy, turning business value into that data. So that’s it. And how do we do this? How do we do this? We do this by closing gaps, by bringing raw data and bringing that and making it available. Freeing the data that is locked within silos of data. Freeing data is only half the story. Then we need to be able to use that data. Find that data before gaining insight into that data, understanding the data before then actually the data is then what we call through data literacy. Is being able to have a conversation, being able to have an argument with that data to gain insight into it. So data integration, data analytics and data literacy together, bringing all of that together. And what we call this and the drive that is often the phrase around digital decoupling. Digital decoupling is a great phrase in that it separates out your legacy platforms, separates out those systems that run the business today to the modern platforms that need the access to that data. And it all starts then with a simple building block we call CDC, standing for change data capture. And this is where called us out as being the leaders of CDC the leaders of independent CDC, the CDC unlocks your SAP systems. Unlocks your mainframe systems, your Oracle, your SQL, your SAP systems, your most valuable data that you have today. It unlocks that data through the mechanism of real time capture. Real-time capture of that of those legacy platforms to bring it into the modern, modern analytics platforms that you’re building today. To drive that action, to drive that insight into those, the data that you have. And here we see three main trends that we see in the industry that are driving the adoption of these newer analytics platforms. The first one is a is around cloud application development. And really this has been ongoing for many years. But really it’s about, whether or not you’ve got you’re a bank and you’ve got your main application locked within a mainframe. But you want to be able to build cloud based applications so that you and I, and everybody today has banking applications on their mobile phone. And you want to be able to access real time data to understand how have I been paid today? Can I pay this bill? Can I run this? The open banking platforms that have been driving digital transformation for many years relies upon having real time data out of that legacy on premise platform. We see this also in the drive with SAP. SAP is holding the most valuable data that the organizations have. But you need to be able to build modern base applications on that data in cloud infrastructure. Second big trend that we see is around the rise of the cloud data warehouse. And, and again, this methodology often around failing fast in around being more agile and about driving different types of business applications on the data. But the traditional on-premise data warehouses just couldn’t solve. Traditional on-premise ETL batch loading that was there to solve a single business use case, wasn’t really agile enough to meet the modern cloud analytics that people are looking today. To just consume the data that they need to answer those business questions faster. And again, real time data out of your operational system in these modern cloud applications is critical to drive the value that these platforms are being built today.

– [Jordan] Ted, do you mind if I hop in here and maybe add some key components here that I hear? You often hear with the data warehouse modernization there’s some components of the streaming that is associated to terms like microservices and also continuous integration, continuous development. And as those schemas and the sources change, or as the systems of the code change, automation is incredible to continually integrate and deploy that. And that’s a key consideration here. I wanna pass it back to you, Ted, but I thought maybe thinking about how that is an incredibly powerful component of that only enhances the partnerships with like Qlik for instance, and component vendors like Databricks with other lakehouse technologies.

– [Ted] Exactly, and this is where just delivering the data isn’t enough. You need to be able to deliver the data in a usable format based on the platforms that people are building today. And, this is where again traditional ETL tools often just a batched up and landed data, but it wasn’t usable for the people building it. And even more so in what we call the next generation sort of data, like data lakes as a service. And really all three of these environments are driven by the adoptions of cloud , driven by the obstructions of scalability. The cloud can give organizations today. And the modern, these modern cloud-based data lake environments are traditionally, you know. There’s a file-based system underneath that. But again, just delivering a file and delivering a file delivering a file, landing a file, you can quickly get a swamp in these environments. But actually by bringing the data, making it usable, bringing it change and making that real time data, but then usable to the business without whatever that use case may be, is where we’re seeing a lot of innovation today. But really these three main use cases we see blending as well. What I mean by that is, you know what is your most valuable data within an organization? You know that mainframe data or that SAP data, that ERP or CRM data itself is critical. And you don’t want to then just build another silo of data. You don’t want to build just another silo of another use case so that there are so just a single use case. You want to be able to bring that data to answer multiple use cases. Make data as a service across platforms. As such, bringing together these three main trends. These three main use cases is where we see organizations working today. Because they don’t want to have just creating another silo for individual use case but they know what the data is that they want. And that is exactly where Qlik data integration is bringing the data. Bringing the data using CDC, change data capture. Capturing the data out of the Oracle and the SQL and their DB two and their mainframes in their SAP and unlocking that and bringing it into the use case. So streaming data in real time out of operational systems, and then making it usable based on the platforms that they’re building today. Streaming data, warehousing, lakes, but we’re seeing that the combination of this on multiple clouds or any cloud before then driving analytics about that. Binding the data, using the data with our Qlik catalog for gaining insight in the analytics across that. So this is the platform. This is how we’re helping organizations, but truly there’s an the integration. What we see with working with partners like Databricks that own the infrastructure and the platform itself, but we can stream that data and learn that today. So the combination of these three pillars. This is where our customers are, drive going to that next level. Bringing together both data integration, data analytics, and then data literacy. Being able to have that conversation being argued with the data, embed the data into the business process that organizations are running today. Freeing, finding, understanding, and taking action upon their data. This really is the uniqueness that Qlik have to offer today. And really that’s really where the bottom line of that nice quote we’re working with IDC. Those organizations that that can get the benefit of the value of bringing this together and understanding the integration of both streaming real time data and the power of Qlik analytics together. The ones that are gonna be driving more value in their organizations today. So that was an overview of a little bit about Qlik, about how we’re helping and how the integration, the trends that we see organizations choosing the best of breed products to help them on their journey to cloud. That help them on their journey to get value out of Spark and Databricks today. And really this is what I’d like to pass on to Jordan to show you, give you an understanding of how this is being used in real life. Do a deep dive into a customer. That’s a shared customer of ours today, and then look at the products so you can actually see and feel and understand the benefits that we’re helping organizations today. So Jordan, over to you.

– [Jordan] Thanks, Ted. When you think about transformations and J.B Hunt was due for a transformation. They’re one of the largest trucking industry partners in the entire North American continent. When you think about how product gets shipped and managed there’s real-time requirements around the shipping location locale, the partners and the products that need to be sourced from different locations and brought together across the supply line. And also the communication of maintaining the maintenance of that system. And that system not only is logging individual trucks and how to maintain those trucks, but infrastructure across rail and air and the trucks themselves in an ecosystem that spans thousands of kilometers. When you’re looking at the overall ecosystem of J.B Hunt, there’s a focus from their side of those operational concerns. So they took three of our partners. They took well, J.B Hunt partnered with Qlik. They partnered with Databricks and they partnered with Microsoft. In this use case, they had a number of different databases. One specifically was the legacy platform of their mainframe that has been running their business for a very long time. Those operational systems. When you look at those, they usually had a nightly operation that then would drive activity. That nightly reload cycle then became part of the process of accumulating what would be a standard data warehousing and legacy data warehouse environment. The operational intelligence that comes from this type of scenario was that they ended up having an insights team that was focused really about prediction, not around insight of alerting on actions that were real-time from the data. Those aspects of the real-time infrastructure that are enabled through a tool like Qlik replicate. But when you’re supporting this bolted on reporting, some of the time-intensive tasks have requirements and the results may not always be accurate or relevant. So the restrictions of the technologies were based around the compute time, often. The sourcing of the data and often manual changes that then change the behavior of the overall system. So what happened was the leadership came to a realization that now that the tools are available, we have six key components along the supply line that we need to consider in our overall vision. For one, a lot of EDI created some of the visibility and lack of visibility that needed to be addressed. Sourcing that from the mainframe, gaining that insight into real time and using just data science immediately started a value chain that grew the organizational needs and requirements over time. As you look at the insights of each one of the assets that were then directed. As you were looking at well shipment requirements or location requirements, telemetry along that supply chain became not only data science enabled, but also integrated applications. Now analytics are becoming the actionable changes that they’re using to then build their business for the 21st century and beyond. J.B Hunt put around I think over a hundred years at least. And as those businesses, champs, for many of the different suppliers, shipping components across the world, supply chains are changing. Even in the changing world that we’ve had these challenges of this year, supply chains have had to continue. And that manageability of what is important, not just the supply line physically but the data supply line. And how these four components that you’re gonna see today become very relevant for their overall experience? When you think about ingestion flexibility, the types of data that come from the different sources are paramount. The repositories that they write into need to be cost-effective. But if those systems change the streaming requirements as Ted noted, the overall warehouse needs to be either data lake, the lake house, the warehousing, all of those types of changes that are part of that cloud transformation warehouse transformation. And then data lake transformation. That’s where the automation is required for managing the loads. And as you monitor for those changes and you monitor for that overall system. Let’s go to an architectural diagram and talk about what moved and how it operated. There were certain structures of how the location for shipped materials lived and where they lived. You had cloud hosted applications in SQL server. You had on-prem SQL servers and you had the mainframe. Replicate, unlocked it. Supporting that data science ecosystem and partnered with Databricks, both from a structured streaming one of the core requirements to building a real time and continuously updated scenario, that then combined the resources from synapse analytics into the Databricks. To then be able to then service models that they can consume as applications then to also service their analytics team. The requirements of security that they had in the mainframe had to be replicated. That was a part of that system that was enabled through Microsoft’s powerful cloud enabled components. Further than Databricks enabled the overall transition of streaming data and gave them the scoring that they needed to change their overall pattern and evolve, Synapse…

– [Ted] This is the part also that there’s just that fundamental value around what decoupling is all about. The DB2, the mainframe, the production systems are locked in as they may have their own life cycle. There may be a two, five, 10 year plan for migrating and or moving on in those platforms today. But the value of that data in those modern platforms is critical. And understanding the why of what people are doing and building out those applications really does highlight that architectural side of the value of bringing real time data out of production. And building up and the ecosystem around Databricks to prove value.

– [Jordan] I just love how you mark that. Because when we talk about storing and processing and maintaining the management of that over time, it’s not as easy as it looks. And it’s being able to have that automation that becomes really paramount. I think you nailed these four points. And I think what’s important next, is we look at the complexity. When you talk about complexity, the first thing you do when you solve any problem is you break it down into smaller pieces, right? Decoupling, and your highlighted components that you had. Ted, were these three components. You talk about catalog, the Qlik analytics platform and the QDA platform. All of that works together. J.B Hunt was focused around this classic slide you’ve seen from Google. The amount of infrastructure it takes to put together a data lake. When you think about the data collection that’s a big aspect of it. You think about machine resource management, Databricks is a big part of that. The process and automation. There’s lots of that. That was one of the key tenants that Ted talked about. Was warehouse automation, was system automation. So Spark gives you analysis. It gives you feature management. It gives you verification. But between the Qlik components, the Databricks components, the Microsoft infrastructure monitoring and overall, and then getting to machine learning, it took quite a bit to bring that together. The Qlik data integration platform, and Ted I’ll definitely ask for you to chime in here. When we are loading this overall ecosystem, this is what we call our core pillar slide. Talking about those key components and servicing the consumption from all of the partners that Qlik works with. And also the other partners on the machine learning side as well as the data science.

– [Ted] It really does, this one slide does tell the whole story very much about Qlik data integration. In that, on the left-hand side there is the operational systems that are run in the organizations. Whether J.B Hunt, whether they’re banking or finance or insurance, whatever they may be. Oracle, SQL, DB2, the mainframe, SAP unlocking that data, but then delivering it based on the use case, we can read once and write many. So if you’re building out applications in streaming and KAFKA will stream that data to those environments. You’re building out modern warehousing on Snowflake or synapse or Google, and the like. We can learn that and make that data usable. On the monitor data lake environment whether or not that family Databreaks or S3 or AWS and like. The storage devices, we will land that data and make that usable. Critical then to understand that use case. So understanding the drivers here what are we trying to achieve again, repeat that part. We’re not looking to then just create another single silo of data to answer one line of business. Those days have gone. You want to be able to have that data usable findable and that’s where the catalog sits. As a catalog can make the business find the right data for that use case. And what is right for one use case is gonna be right for another. So being able to reuse that data for multiple business lines. And that’s where the analytics AI machine helps with data science can all consume that data. This is the end to end flow. That that really has highlighted well the overall independence of the platform. And that data is an asset. An asset can be consumed by multiple business units, within the organization.

– [Jordan] Awesome, Ted. Let’s go show them how it works. In this sequence, what we’re gonna discuss are the change data capture components that replicate provides. Thinking about the batch, initial load, the change data capture and the change of the DDL. When you’re gonna store this you want to be able to automatically merge to Delta. Transform that Delta and create a time operational data store. That data store can then be used within a Databricks notebook to build Spark ML and ML flow integrations where we then can render them back through Qlik sense. What this product does is it takes in the ingestion, whether it’s a real-time stream or the bulk load, but it takes it into the bronze layer of Delta tables. Both notifying the actual files, the tables and the metadata, associated DBFS. And as well merging into that component. In this section, we’re gonna be talking about how to create a connection to Databricks. We’re gonna be loading into ATLS and we’re also meeting hooking into the meta store. As we configure that operation we’re going to then walk within the replicate console as you can see here. The replicate console, we’ll start with a new task, which we’ll call sparksofademia. I’m gonna incorporate some of the store changes functionality to carry all the histories of the transactions that have occurred prior. In that operation, we’re gonna configure some of the end points such as the SQL server. But we can connect to a bunch of different platforms whether they’re on prem or in the cloud. And we’re then are gonna target the Delta component from our replicate console and connect to it. So we bring over our two sources and targets. And what we’re gonna do is we’re gonna tune that, both for the full load and optimize the loading functionality, as well as then the number of tables that we’re gonna pull from. Some of them being smaller, so we’re just gonna extract them very quickly. I’m also gonna incorporate some of the store changes functionality. Which auto keeps, which is aware of the partitioning requirements on the system. As you bring these data sources over, now we’re selecting multiple schemes and multiple sets of data. I can also adjust my columns as I need to. So such as maybe do some math functions, add first name and last name. Or put in date times and input some new records as you’re going. What I’m gonna do now is we’re gonna kick this off to do some bulk loading via replicate. As we’re loading these records, we’re gonna be loading about 7.8 million records in this operation. Let’s get this started. We’re gonna click this start processing which begins creating a loop of all the tables, 74 tables. We’ve loaded that into what we call the perf bronze schema that’s inside Delta. And what we’re gonna do is we’re gonna bulk load all of these over. As you can see on the screen, there’s a, quite a few of the smaller ones in the thousands and couple hundred records. Each table often look up tables or dimension tables and the dimension in the data warehouse. But as we’re loading these operations, what you’re gonna see is we’re gonna start to hit in the millions of records. Such as the fact and sales information records. Well, you see right here you’ve got sales order and large without 4.8 million. Another one on order header, which is about 2.2 million. And these bulk loads are really bringing over a lot of records. You’re seeing 2000 records a second, and it’s moving across. As we processed this operation most of the data comes over in a compressed format. Those are the two components of a replica task. There’s the integration to the source database and how it’s part of the native database. And then there’s also the in-memory task of transferring the data in a compressed and an encrypted format. Some of this data as it’s almost completely finished now, will be both mapped for common data types between two systems, as well as then the handling of the batch and the generation of the types of code native to a bulk copy. Often like the BC bulk copy command. And as we’re almost done with this one, I think this is sales order header. And it completes, we’re gonna discuss creating a replication task. Now that we’ve got all that over and that lasted all of about five minutes. We’re gonna go into part three, which talks about capturing, merging and managing those real-time changes. As we capture and manage these, what we’re gonna be looking at is the deletion of millions of records that you can see right here. Secondarily, we’re also gonna run a bunch of inserts. I think I got one of the tables. I put a million and another one, a hundred thousand. And in this utility, we’re going to kick off and manage both multiple tables that are listening for inserts, updates, and deletes CRUD operations. When it’s running these operations what replicate does is it becomes a native deployment of the source system. For instance, we’re hooking into a SQL server. So it’s using its own inherent ability to understand the SQL engine. None of the replication API but a very core integration to the SQL server engine. They have like binary log reader, which is one of the key components of reading the Oracle database or certain messages inside of a mainframe. But this listening component then gives you that really powerful extractor. And when we’re generating a bunch of the code and extraction components, you’re seeing millions of rows of truncation. It already captured what was in memory. So now it’s cached the change records into disc. So that it’s read them out of the database and put them into the transfer cache of the replicated engine with the command criteria that it will then generate when it loads into the target Delta. What we’re looking at now is the optimizations. Like you can see that the clusters moving along really, really well. And I’m gonna go into the database and I’m gonna run account. And I’ve got 4.8 million records in there. Or I go to the address and I can show that we’ve collected sample data from those routes. Thanks everybody for joining us. We’d love to hear your stories on how you’re doing the digital transformation and decoupling your data.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Jordan Martz


Jordan Martz works at Qlik, as a Principal Solutions Architect within the Partner Engineering team, focusing on Data Lakes and Data Warehouses Automation, specifically on SAP system, how to design real-time streaming applications, and supporting the effort around analytic roadmaps for consulting partners. Previously, he was a Sr Solutions Architect for Databricks, on their partner engineering team. Ironically before Databricks, he worked for Attunity, as the Director of Technology Solutions, which Qlik acquired.

About Ted Orme


Ted is Head of Data Integration Strategy for EMEA at Qlik, supporting all parts of the business. Directly working with sales, marketing, alliances, and product management to take our vision to help accelerate business value through data. With over 20 years of industry experience he is a leading spokesperson within the data integration space