How to Extract Deeper Value from Data in Legacy Applications with Analytics in a Cloud Data Lake

May 27, 2021 03:50 PM (PT)

Because of the cost-saving benefits, competitive advantages, and new market opportunities, many data leaders feel the pressure to accelerate cloud adoption. But implementing hybrid and multi-cloud strategies is daunting—incompatibility between diverse systems introduces too much complexity, and traditional ETL tools are slow and resource-intensive.


While you gradually adopt the cloud, you can still deliver immense business value and actionable insights without impacting existing on-premises SAP, ERP, CRM, HCM, SCM, and other enterprise applications. By building real-time data pipelines, you can meet current demands while setting the foundation for the future. 


During this session, you will learn:

  • How to implement a continuous data streaming method that securely and efficiently leverages data residing in SAP and other on-premises enterprise applications without impacting them
  • How HVR’s advanced log-based CDC technology securely accelerates high volume data delivery in complex environments, so it’s available in real-time for immediate decision making, advanced analytics, and ML/AI
  • Real-world examples of how hundreds of organizations leverage HVR to extract value from legacy systems to achieve a more dynamic and agile business
  •  How to leverage Databricks open and unified platform to transform massive amounts of data for advanced analytics, AI, and machine-learning
In this session watch:
Andy Kazmaier, Sales Engineer, Partners & Channels , HVR



Andy Kazmaier: Thanks very much for joining us. I’m so happy to be here at the Databrick Summit. My name is Andy Kazmaier, and I am a partner sales engineer with HVR. We are experts in log-based CDC replication. So today, we’re going to dive into how you can get more value from your own data that’s sitting in those legacy apps by leveraging modern real-time analytics in the cloud. So here is a quick agenda. First, we will address what the compelling reasons are, and there’s quite a few of them, that encourage this adoption. So what are the major challenges you might face? What will your analytics store look like? And we’re going to take a look at some use cases or case studies, if you will, that will apply to your world. And then we’ll wrap up with an introduction to HVR.
So like I said, there are a lot of compelling reasons. You have new security frameworks that allow for easier administration. There’s better scaling, especially with a tool like HVR that offers a framework that distributes workload and reduces complexity, it gives you finer control. So that flexibility, you have more options for consumption. And that facilitates that self-service culture you’re looking for. And as many of you know, those legacy systems can be very expensive when you want to expand storage. There’s also differentiation. You can set yourself apart from your competition by embracing the cloud as a target for that data. So then, you can start to ask some questions, like, can I monetize this data? Possibly now that it’s more accessible, maybe you can use it in ways you couldn’t previously to generate those new revenue streams.
So there are a lot of different common questions, if you will, about those challenges. They all come down to these six, if you will. And obviously the big one, the one we’re going to start with is, what pieces do we move first? And maybe in a past migration, you move a Dev or a QA environment, or maybe you migrated some web servers from on-prem to Elastic, for example. But now, we’re talking about your analytics, your operational systems, and what we see is that the warehouse gets moved first or the analytics platform, if you will. And then, the operational systems come in a bit later. People also want to know what provider is right? What platform provider? Is it Google? Is it AWS? Is it Azure? And believe it or not, those are not the only ones. Those are just the big three, but there’s more. How do I keep that flexibility and maybe spread risk across multiple vendors?
I need to ensure that I can get my data back once I’ve migrated it. Also, what will that connectivity look like between the cloud deployment and those remaining on-prem systems? That’s an important consideration. How do I technically enable my teams? What does the budgeting look like? We’re not looking at those gigantic capital expenditures. We’re looking more at a subscription-based operational budgeting model, which could be different for people now. So we’re going to focus on what pieces to migrate first. I’ve already mentioned the warehouse, but now we’re going to add a twist and ask ourselves, do we go with that Warehouse? Or a Data Lake? Or a Lakehouse maybe? Before you can make that determination, it’s going to be helpful to understand the history behind these implementations. So, many of you have massive amounts of data measured in the terabytes. It could be dozens of terabytes or more. How do you move that much data?
It could be daunting. How do you keep that data synchronized without affecting those highly transactional systems that you want to protect from performance degradation? How can I verify that synchronization and know that my data is right? And then lastly, can I do this in a phased approach? Maybe you work in an agile environment. So the concept of sprints or incremental progress might be very familiar to you. That’s where we’re going to go with this. First, let’s talk about the operational systems where we’re getting this data from. So very commonly, it could be an ERP or a CRM system, and HVR sits in the space very well. It could be logistics systems or work order systems like Workday. And when the warehouses initially began, this was the kind of data that you included.
And you had those early leaders like Teradata and Netezza, and they were very purpose-built to provide reporting capabilities in a very efficient and specific manner. There were other ways as well, or you might repurpose Oracle or SQL server as a backend for a Warehouse as well. And they serve those purposes very well. So they were indexed very specifically for this purpose. And you would use them to build very structured reports. And then you might even break up that Warehouse into departmental silos that would become your data marts.
So this was really a great solution for many years, until someone asked, how can they bring in the rest of their data? Maybe you have logged data or semi-structured or unstructured data. And now, you have access to social media data, and you want to be able to paint a fuller picture of your customer so that you can understand better how to appeal to them and make better informed decisions. So, as a result of this need, you saw the development of the Data lake. They used relatively inexpensive commodity hardware. You had Linux servers that you clustered, you could bring nodes up, down. You had resiliency, high availability and all with relatively inexpensive storage.
So forget the Warehouse, just throw all of your data into the Lake. And in that rush to consolidate all the data in one platform, we forgot about that primary use case where the business users are relying on very structured reports. So it wasn’t necessarily a great solution for them, and many of them started to see it as more of a data swamp. But the data scientists on the other hand, and those who are interested in machine learning and artificial intelligence, they were served very well by this development.
Another development right around this time, around the time of the Genesis of the Lake was the advent of Amazon’s S3 storage. And that facilitated the Lake, but in the cloud. So that drove Data Lake adoption, because you didn’t have to manage these clusters anymore. They’ll do it for you. So not only would they manage storage, but now you have even more powerful redundancies. You’ve got multiple availability zones. You’ve got backups that are being done even without your knowledge. And most of these features apply not just AWS, but to all of the major cloud storage providers.
Not long after that, like I was saying, Azure followed up, you had ADLS, and that gen two and Google has their cloud storage as well. So they’re all playing in this space. Now we have the latest entry, which is the Data Lakehouse. And this was pioneered by the team out of Spark who then created Databricks, and their implementation is called Delta Lake. And the Lakehouse is really giving you the best of both worlds to some degree. You can take your structured and unstructured data and put Spark processing and execution on top of it. So if you look at the evolution of Spark, you’ll know that they place a heavy importance on SQL and it really is a major differentiator.
If you know that you can get your data out of that lake by using SQL, then you know your reports are going to work, and they’re going to be much more efficient than running that SQL on top of Hive, which wasn’t optimized for that purpose. So the Lakehouse solves the problem and it even provides asset compliant transactions, which means that you can be sure about your data. And you’re still getting that advantage of being able to leverage it for machine learning and artificial intelligence.
So let’s take a look at some of the players in these spaces. And if you just look at the Warehouse, Redshift was really the first one in the cloud. Azure Synapse is relatively new. It plays really nicely with Databricks. You see lots of Data Lakes out there. S3 is obviously very popular. They’re an easy way to store lots of data in one place, but not necessarily easy to get access to that data in a way that covers multiple use cases. So these lake implementations also provide underlying storage for your Databricks Lakehouse. So they’re not going anywhere. HDFS is an outlier, it’s still around. I would say mostly in on-prem installations.
And obviously lastly, here we have on the right, you’ve got your Data Lakehouse that was pioneered by Databricks, like I said. They have a great white paper available on their website. I encourage you to read up on it. And like I said earlier, they were the creators of Spark and they are providing schema enforcement, full asset compliance with SQL access to your data. And they have a lot of different deployment options, architectually speaking. And lastly, I just want to remind you the audience I’m speaking to are those enterprise organizations with terabytes and terabytes of data.
All right, we’re going to move into a case study. We’re going to talk about one of our customers who took their first journey into the cloud. Once you start moving to the cloud, it’s not normally a one-time lift and shift. It’s going to be an ongoing process. So you have all these different pieces out there, and you can see all these different apps, these arrow, these point to point connections, we’re moving SAP out to SAP and then into S3. Multiple tools, multiple extraction techniques, or integration patterns. You can see there’s trigger-based, there’s batch ETL, et cetera. So you have all these different pieces out there, right? But there’s also the human element. So people are expecting to be able to keep doing their jobs while you’re doing these migrations, right? Your businesses can’t afford long downtimes that cause you to lose customers or maybe lose traction to your competitors.
So let’s take it, look at what the landscape looked like after we finished. So they obviously had a lot of pieces out there. We were able to help simplify that. Some of the tools they had, like I showed you with SAP, had a really complicated data path that required us to move data to intermediary sources and then to Redshift, just so it could be consumed by the BI users. And we were able to reduce this complexity through our extensive source and target data source support. So it’s still difficult to have that one tool that will do everything for you, but we reduced their reliance on multiple tools. And that improves speed. It improves the administration because there’s less configuration. It improves reliability. There are fewer layers, fewer systems having to talk to each other and just overall user satisfaction is going to be improved.
So just to recap at a very high level, HVR as a company, high volume replication, that’s what we do. We deal with massive, massive amounts of data. We are a very specific pattern of data integration, which is log-based CDC replication. And we obviously feel that we do it better than anyone else. So back to our original challenge, how do I get that data into the cloud? Really, it’s a step-by-step process. So the first step invariably is to get a copy of that data into the cloud, right? If you’ve done this, you’ve been in data for a long time, you might be familiar with the term refresh. We also call it an initial load.
We do this very efficiently, including partitioning or slicing the data if necessary. We can move tables in parallel. We can move pieces of a single table in parallel, depending on resource availability. So for example, if you have a one terabyte table, we can slice it into 10 slices and move a 10, 100 gigabyte slices simultaneously to that target, reducing the time it takes and the network traffic. We also have some compression capabilities I want to talk about it in a few minutes. So once you’ve done that initial load, now you can actually start to consume that data with your tool of choice. We have freed your data, but you are the experts in how you want to consume it. So maybe you have multiple consumption tools in your enterprise, like a Tableau, ThoughtSpot, Looker, Power BI, or maybe your data scientists use notebooks like Zeplin.
Once you’ve started to consume that data, obviously that copy is going to start to lose its value very quickly if you don’t keep it up to date with the source. So we grabbed those changes or captured them, if you will, in a completely unobtrusive way that does not affect the source systems and we stream and integrate those changes with the target. If you are not sure of the quality of your data, that’s obviously not a good thing. So we include a built-in validation and repair capabilities to give you that confidence that your data is in fact, a bit for bit copy, and it’s intact.
Here’s another case study. One of our customers is an auction technology provider. So as you can see, they had 230 sources of data you can see here. Now, they didn’t move all of those at once. We prioritized them firstm, and then we moved the highest priority sources first, and then continued to move them incrementally like we’ve talked about before. When they were done, they had this new centralized reporting. They could easily feed back out to their customers as a great new value add for them. So now they have this new reporting capability. They make a change in one location and it propagates and shows up for all of their customers. And there was another added benefit that they derived from this migration. Their customers are largely not IT experts. And so now, their customers had a redundant data store in case they suffered some kind of catastrophic loss. They were able to offer that as a benefit where they could re-instantiate their customer’s data and resurrect their data. So it was a pretty cool added benefit.
Now, here is another customer who ended up using our solution to offload their data to the cloud, and then brought it back to a separate instance on-prem. So this architecture gave them a lot of flexibility with regard to how the replicated data could be consumed. With that cloud instance on the right, provides ease of maintenance, easy managed access to a variety of consumers. While that replication to a separate instance on-prem gave them the SLAs that they were requiring for their internal reporting. So this is just an example or illustration that architectures can be very wide-ranging, maybe even a little surprising, but they’re flexible enough to support a variety of use cases.
All right. So can we be a little more specific about moving that data? Yes, let’s do that. So let’s jump in and now I’m going to focus on HVR. These are really the three features or not features, but main aspects of HVR. If you have needs that require a low impact on your source, you require a security, require real-time access to data, then we are a good candidate for solving your problem. So that low impact reference means that only the changes are moved. So we look directly into those log files. We’re not querying your database, and we get those changes and through the use of an agent on that source, we can encrypt and compress that data that’s going to be replicated. And we see a compression rates of 10 to 20 times. So we can lower your egress costs, lower the time it takes to replicate, lower your network traffic overall. And that plays into that middle point, they are secure. That agent is also going to encrypt it. So not only can you pass this data within TLS secured by TLS, but we will also encrypt the traffic within that tunnel.
And lastly, we provide real-time Maximus. By having an agent running on that source, we’re constantly looking at changes. You see latencies in the seconds are possible. All right. This is a great slide that shows you just the versatility of HVR. So on the left side, you can see different categories of sources that we support from the source side. The databases are probably the most common use case for us, grabbing data from those legacy systems, whether it’s just a straight up Oracle system or maybe Oracle that’s underneath SAP. So any type of those relational traditional systems, we absolutely support. SAP is also a huge plus for us. People want to free their data from SAP, not be locked into their own analytics tools. We can export data to a Delta Lake, and now you can run analytics on your SAP data without going through the SAP ecosystem.
This is a basic architecture for HVR. It’s multi-tiered, if you will. So the hub is essentially sitting on its own infrastructure or infrastructure agnostic, so it could be Windows or Linux in any of the cloud providers. It is essentially your orchestration layer. It has your repository, and it communicates with the sources, captures the data. You can see over on the left side, this little HVR icon represents an agent that sits on that source, so that it’s peaking into those log files. It’s compressing the data, encrypting it, sending it to the hub, and then the hub can send it to a variety of targets. We capture once, but we can write many times. So you could write two, three, or four targets, that one set of data that you captured.
Now, this next slide is a little repetitive, but it gives me a chance to review the step-by-step process. You start by the bulk copy of the data, the refresh, if you will, that initial load. And we can do that in parallel or not. It all depends on your architecture and your needs. And then, we begin to capture those changes from the source, and then we integrate them into the target, continuously streaming those changes. We validate, we have built in validation. We do that based on a CRC check, and we can do it at the table level and say, these tables are all the same. If you find a table that’s different, then you could drill down and do it on a row by row level and find out where that issue is. And we can repair that for you. And then lastly, the port refers to upgrading your reports, porting them from the source location to the target. Really, just pointing them at the new location of the data, and then innovate analytics. You freed your data, you have options now that you didn’t necessarily have before you took this journey.
So here’s the last slide. Just one more recap. We’re going to reduce any kind of overload on your systems by accessing that data from a different source. We move terabytes and terabytes of data. We do it every day. We moved terabytes of data on a daily basis. You can accelerate those queries, because now you’re not limited by that production source. So that’s it in a nutshell. Thank you so much for your time. I encourage you to visit our friends at Databricks, check out their white papers, their primmers, their webinars. Please email us at HVR for more information, check out our website. We have tons of webinars, demos, test drives, where you can experience our product for yourself. Stop by our booth to learn more. Thank you and take care.

Andy Kazmaier

As a seasoned technology professional with 20+ years of experience as an architect and engineer, Andy enjoys problem solving and working closely with partners to identify and solve technical challenge...
Read more