Decoupling Your Legacy On-Premise Data into Operationalized Lakehouses

May 26, 2021 03:50 PM (PT)

Qlik’s solutions include Data Integration functionality for ingesting, applying, cataloging, and analytics. During this session Qlik will showcase the solutions being leveraged by our mutual client JB Hunt to load data into Delta, which included Replicate and Compose for Data Lake generation, Delta Merge statements, Cataloging the system, and querying data in Qlik Sense for Analytics.

In this session watch:
Jordan Martz, Principal Solutions Architect, Qlik
Ted Orme, Head of Data Integration Strategy, Qlik



Ted Orme: Hi everyone. Welcome to this session, where we’re going to be talking about Decoupling Your Legacy On-Premise Systems, your Oracle sequel, SAP mainframe, and the like, and enabling real-time data into the Lakehouse. My name is Ted Orme, I’m Head of Data Integration Strategy here at Qlik and I’m joined by my colleague, Jordan. Hi Jordan.

Jordan Martz: Hey Ted. Good to see you, and a lot of old friends.

Ted Orme: What we’re going to cover today. I’m going to do a few slides at the beginning. Just talking around the industry as a whole, Qlik, the journey that we’ve been going on. Jordan’s going to do a deep dive into the technology, show a live demo, and we’re going to have some time for Q & A at the end.
First just a legal disclaimer. We’re Qlik. We’re a software company. We’ve been around for a very long time now, helping organizations run analytics. We’re a market leader with our analytics and visualization tools. That’s really what you may know us best for. What has changed is through acquisition. We now take a holistic view of an end-to-end journey of data. I come from the Attunity side of the business, now rebranded as being Qlik Data Integration.
What are the challenges we’re helping organizations on? Helping organizations truly understand their data better. Get the value out of their data, the business value. The challenges we see is these organizations often don’t really have access to the right data. They don’t trust the data that they have. They don’t have the tools or the skills in order to get the value out of the data.
How are we helping organizations? We’re helping organizations turn that data into business value with three core business components from Qlik. Data integration, data analytics, and data literacy as a service. Being able to have a conversation and actually then argue with the data.
How do we do this in the flow? By overcoming those challenges. Closing the gaps by the integration lab, that’s where the Qlik Data Integration is streaming real-time data out of operational systems. Bringing that data to the business. Bringing the right data, in the right format, to the business where the business can then drive action and insight out of that, because they have the skills and the tools associated. These three pillars are the flow that we can help organizations on, get the value out of the data. Freeing the data. Finding it. Running the analytics against that, then having that conversation with the data.
What we’re going to do, we’re going to look in a little bit more detail now around Qlik Data Integration, specifically around what we call Digital Decoupling. What does that mean? Digital Decoupling is where you’re building out applications. Building out other systems in order to better leverage that the core legacy platforms that you have today. Legacy platforms, which ultimately are there to run your business, sell more widgets, sell more policies, drive more revenue out of the organization.
CDC is the key component for Change Data Capture. Out-streams data out of those systems into those secondary, into real time, into the Lakehouse.
We see three types of organizations, and how they’re using their legacy platforms. What we call here, the Laggards. This is what people used to do, in that they have their core systems and they’d build more applications on that. You’d have a lot of costs associated with that. Building more applications directly on your source systems. The first stage of decoupling was using ETL tools, outputs to a warehouse, [inaudible] environments. They’re seen as being batch-orientated and quite separate from the core legacy systems.
What we’re really seeing is the innovators are the ones that are using real-time data. Streaming real-time data out of their Oracle, SQL, SAP, mainframes [inaudible] into a unified Cloud elastic structure. Then running the transformation there. This then actually bring the data closer. Brings the business closer to those legacy platforms. By the business coming closer to those legacy platforms, that is really then driving innovation and change where they can actually use the data better. Real-time is the key for that. Real-time then brings that data to the business, so they can drive innovation and change.
If we look at the benefits these three environments have, typically the costs. If you’re just building more applications on your legacy platforms, you’ll start getting a bit of benefit. But over time, the complexity, the business value will decline.They’re difficult to maintain. You’ve got to maintain all of those applications that are running directly against the mainframe, or against the on-premise environments. The early adopters, they exposed the data. They got the data into their Cloud environments faster. They got to see some value out of that, but it was still only batch. It was then still complex, is actually the true decoupled environments. The true innovators. stream real-time data into their Cloud environments. Then they can start building applications that are driving business value.
This is really where I love the quote that the CEO of MasterCard said a few years ago. “Now it’s not the big fish eating the small fish, the fast fish eating the slow fish.” You can do better things faster with your data. You’ll see the benefits, environments, and be more competitive in the market.
Now we’re going to quickly look at the three trends we see in the industry about, “Why?” What I mean by here is, “Why are organizations going through this journey?” The first one is around the application modernization. Here, everyone knows it’s this isn’t it? The mobile phone. Everybody has the mobile phone with banking applications on there. You need real-time data for that. The open-banking platforms that are being built now are streaming. You need real-time data. Streaming real-time data out of mainframes, out of Oracle, those billing systems, InstaCloud, building out applications. That’s been one of the real big drivers.
Second trend that we’ve seen, is around the modern data warehousing platforms. Whether or not this is Redshift, this is Synapse, this is Snowflake, and [inaudible], but the warehouse environments where also complex. Those Cloud Data Warehouses are driving innovation and change, but you don’t want to reuse your traditional ETL platforms.
If you’re going to reuse ETL, there wasn’t any benefit with going to the Cloud, those Cloud warehouse environments. Here, this is where we offer automation. They’re streaming real-time data, and then automating the value out of the data that you’re delivering. The data is being streamed and delivered, then made available to the business in a usable format.
That third trend is around the modern Cloud Data Lake. These are the lessons learned of some of the first generation of Hadoop and on-premise environments, but actually Cloud-based environments in [inaudible] here. This is the world of data breaks in the Cloud in order to really truly use the data better.
These three trends, application, modernization, data warehouse modernization, and the next generation of DLaaS all come together. What we used to see in [inaudible] used to come to IT and say, “Can I have this SAP data?” IT would go to an ETL vendor, move that data, and create another silo of data. Line of business would then have this line from source to target for that individual line of business. Then the next line of business would come and ask for data, and the next, and the next. What we now see is all of these applications, all of these use-cases coming together, in this Cloud-based environment for the Lakehouse on any Cloud on multiple Clouds.
This is where Qlik Data Integration are seen as being the leading independent vendor. Data Integration. Streaming out your most valuable data. You know what is your most valuable data. The transactional systems. The Oracle. The secret SAP. Stream directly into any Cloud environment, and then run the analytics at the end of that. This is the journey that Qlik are helping organizations on. Best of breed solutions, both for integration, analytics, and being independent to wherever the data lives today. Whatever type of data. Our uniqueness about bringing out the mainframe and SAP.
We’ll look at a little bit more detail in the demos. [inaudible] here about what the data is, and how we do that. We can deliver that data in a usable format to the business, whatever platform you’re building today, and then on the analytics against that. That really is how Qlik are helping organizations. Really showing the value of that end-to-end, the change in the way people think about this, as opposed to not just an IT challenge, not just a business challenge. Overall, “How best are you using data?” Enabling you to have better access to that.
Now we’re going to have a look in a little bit more detail, and look at some of the underlying of our products, technologies and all. I’ll hand over to Jordan to a deep dive into the products.

Jordan Martz: Perfect, Ted, thank you. What we’re here to talk about today is how Qlik & Databricks have a deep integration. When you think about the data challenges that Ted noted, what we’re looking at are a couple of different scenarios. The ingestion. The sequencing of the data. How you manage that. Where Delta really shines. The management. The governance. The integrity. Type of sequencing that you would consider.
Then the consumption layers. Our QDA analytics platform. When we look at our approach itself to a Lakehouse, how we take agile data operations. Being able to consume, whether they’re the incrementally loaded-in batch scenarios, that load into Delta files, they may have CDC integrated into them, and they should. When you’re looking at the raw Deltas, trying to consume them, and bring them back together into an arena like table, being a full history, evolved and merged to DLT, and then managed to handle for schema evolution history. That history may have certain considerations.
We described them as operational data stores, or historical data stores. That’s what you see right here. What that covers is the Type 2 history. The actual snapshots that allow you to fully understand the life cycle of that dataset.
When we look at this at our Databricks platform, when we’re taking in all of our sources and we’re loading from that same system… We’re going to load today into Delta. That’s what I’m going to demo. We’re going to load into the Delta actual Spark operations, and into the Delta tables. Updating the metadata. The underlying file system from one of the three Clouds. As well, we do this, we’re all about sequencing it and enabling it, so that you can take over with notebooks and do a ton of powerful machine-learning, scoring, and consumption of BI for that Lakehouse.
If needed, we integrate to the system via our catalog. It hooks in via JDBC to different systems. That also helps you to understand the full lineage of those tools. When we talk about CDC, this is a staple of our platform. When we talk about that replicate product, there are a couple of key sequences that are involved that make it unique. There’s the handling for schemas and data types. It will enumerate a list on the source to us, a list of corresponding target data types on that target. It has that very paralyzed and optimized, full bulk load.
At the bottom, where you really talk about the transformations, the filters, and the ability to handle for changes in DDL. That’s where the sequencing takes off. When the sequencing takes off, what we’re able to do is transform and change with the data. That time stamping, that leverage of what happened at the source, and the ability to replay that, evolves the current data architectures that we’re used to.
As Ted noted, when we talked about the Hadoop markets, now the Delta lake, and now even to the Lakehouse, that overall sequencing is incredible for that transformation. What we’re going to look at are what are these next generation lake houses? That acceleration to Cloud. Getting rid of components in the data warehouse for the object store. Processing it using Delta. Then merging those data architectures into that Lakehouse that we noted.
We’re going to talk about a customer J.B. Hunt. Joe was one of the ones that helped us through this journey. In this example, what they were looking for was their mainframe migration. They were talking about how they were going to load it previously, and incrementally. A daily load to a staged system that took… Once a day, loaded data, 30 minutes to refresh even the BI reports. What that included was a lot of time doing bolt-on reporting, manual processes, and configuring the systems for change consistently.
Overall, their scenario was changing their EDI. Their formats for how they collected the information. Their machine learning models, how they manage them overall. But really from a business standpoint, it was the assets. They’re a trucking company. The telemetry of those assets across the ecosystem, whether geospatial applications, or just the applications that run their business and the analytics within. There’s so much more to that journey.
Their overall journey, when you look at it, takes them into this pathway. A flexible ingestion pattern that’s maintained within their Delta lake, including mainframe sequel and the rest. A repository, whether they’re hitting their event hubs or going directly into ADLS and blob store, or their object store that they were using. They have their automation. Efficient, replication. And the monitoring they use. The ML Ops they use with MLFlow.
When you look at this… Let’s take it to an architectural diagram. Their current platform. Enriched data, pulled from DB2, using SQL servers replicated using ADF. A collaborative platform of multiple ETL and streaming tools. All loading to a actual Delta lake and Lakehouse. As it went through their bronze, silver, gold, they had a merging of a data warehouse. They had a model trading experience. They had containers serving that. A very rich and powerful experience, across a multitude of tools.
So, Ted, I just wanted to say, thanks for joining us. I’m going to hop right into the demo. So [crosstalk]

Ted Orme: Cool. One of the things I would just like… This really is that modern architecture that we see. Rather than having one use-case, having one way says, bringing the data in to the Lakehouse, and then bringing best data out of that. That’s your golden source. Golden source that sits there, that then serves as the business. In a trusted format. This really shines that whatever industry in whatever environment, whatever vertical you’re in, this is still the right architecture to meet that business case for each individual line of business, coming to try to get access to the data.

Jordan Martz: Perfect. This demo, what we’re going to do is… and I’ll go into it in a much deeper fashion next, but covering the CDC, the batch covering how DDL changes occur, and we’re going from bronze to silver. So here’s this merging functionality we’re going to generate that. All right. Well, thank you for today, Ted, going to the demo.
In this sequence, what we’re going to discuss are the change data capture components that replicate provides. Thinking about the batch, initial load, the change data capture, and the change of the DDL. When you’re going to store this, you want to be able to automatically merge it to Delta, transform that Delta, and create a real-time operational data store.
That data store can then be used within a Databricks [inaudible] to build Spark ML and MLFlow integrations, where we then can render them back through Qlik Sense. What this product does is it takes in the ingestion, whether it’s a real-time stream, or the bulk load, but it takes it into the bronze layer of Delta tables, both notifying the actual files, the tables, and the metadata associated DBFS. And as well merging into that component.
In this section, we’re going to be talking about how to create a connection to Databricks. We’re going to be loading into ADLS, and we’ll also be hooking into the meta store. As we configure that operation, we’re going to walk within the replicate console, as you can see here. The replicate console… We’ll start with the new task, which we’ll call Sparks Summit AMEA. I’m going to incorporate some of the store changes functionality, to carry all the histories of the transactions that have occurred prior.
In that operation, we’re going to configure some of the end-points such as this SQL server, but we can connect to a bunch of different platforms, whether they’re on prem or in the Cloud. We then are going to target the Delta component from our replicate console, and connect to it. So we bring over our two sources and targets. What we’re going to do, is we’re going to tune that both for the full load and optimize the loading functionality, as well as then the number of tables that we’re going to pull from. Some of them being smaller.
We’re just going to extract them very quickly. I’m also going to incorporate some of the store changes functionality, which is aware of the partitioning requirements on the system. As you bring these data sources over, now we’re selecting multiple schemas and multiple sets of data. I can also adjust my columns as I need to. Let’s maybe do some math functions, add first name, last name, or put in date times and imputed some no records as you’re going. What I’m going to do now is we’re going to kick this off to do some bulk loading via replicate.
As we’re loading these records… We’re going to be loading about 7.8 million records in this operation. Let’s get this started. We’re going to click this Start Processing, which begins creating a loop of all the tables. 74 tables. We’ve loaded that into what we call the perf bronze schema that’s inside Delta. What we’re going to do is we’re going to bulk load all of these over. As you can see on the screen, there’s quite a few of the smaller ones in the thousands and couple hundred records at each table. Often look up tables or dimension tables in the data warehouse. As we’re loading these operations, what you’re going to see is we’re going to start to hit in the millions of records, such as the fact and sales information records.
You see right here, you’ve got sales, order, and large, with about 4.8 million. Another one on Order Header, which is about 2.2 million. These bulk loads roads are really bringing over a lot of records. You’re seeing 2000 records a second, and it’s moving across.
As we process this operation, most of the data comes over in a compressed format. Those are the two components of a replicate task. There’s the integration to the source database and how it’s part of the native database. Then there’s also the in-memory task of transferring the data in a compressed and an encrypted format. Some of this data, as it’s almost completely finished now, will be both mapped for common data types between two systems, as well as then the handling of the batch and the generation of the types of code native to a bulk copy. Often like the BC Bulk Copy command. As we’re almost done with this one… I think this is sales order header, and it completes. We’re going to discuss creating a replication task.
Now that we’ve got all that over, and that lasted all of about five minutes, we’re going to go into part three, which talks about capturing, merging and managing those real time changes. As we capture and manage these what we’re going to be looking at is the deletion of millions of records that you can see right here. Secondarily, we’re also going to run a bunch of inserts. I think I got one of the tables, I put a million, and another one, a hundred thousand. In this utility, we’re going to kick off and manage both multiple tables, that are listing for inserts, updates, and deletes, crowd operations.
When it’s running these operations, what replicate does, is it becomes a native deployment of the source system. For instance, we’re hooking into a SQL server, so it’s using its own inherent ability to understand the SQL engine. None of the replication API, but a very cool or integration to the SQL server engine. They have binary log reader, which is one of the key components of reading the Oracle database, or certain messages inside of a mainframe.
This listening component then gives you that really powerful extractor. When we’re generating a bunch of the code and extraction components, you’re seeing millions of rows of truncation. It already captured what was in memory. Now it’s cached the change records into disc. It’s read them out of the database, and put them into the transfer cache of the replicated engine with the command criteria that it will then generate when it loads into the target Delta.
What we’re looking at now is the optimizations. You can see that the clusters moving along really, really well. I’m going to go into the database, and I’m going to run a count. I’ve got 4.8 million records in there. Or I go to the address and I can show that we’ve collected sample data from those reps.
Thanks everybody for joining us. We’d love to hear your stories on how you’re doing your digital transformation and decoupling your data.

Jordan Martz

Jordan Martz works at Qlik, as a Principal Solutions Architect within the Partner Engineering team, focusing on Data Lakes and Data Warehouses Automation, specifically on SAP system, how to design ...
Read more

Ted Orme

Ted is Head of Data Integration Strategy for EMEA at Qlik, supporting all parts of the business. Directly working with sales, marketing, alliances, and product management to take our vision to help...
Read more