Accelerate Spark-Based Analytics by Minimizing the Barriers of Large, Complex Hadoop Migration

May 27, 2021 11:35 AM (PT)

Enterprise organizations recognize that on-premises Hadoop environments have become overly expensive to run and maintain and cannot effectively meet today’s Data + Analytics team demands. Business critical initiatives require moving petabytes of data to the cloud in the pursuit of digital transformation initiatives, cost optimization, IT agility, and implementation of AI and Machine learning. However migrating data from Hadoop to Azure to begin data modernization and Spark-based analytics is often stalled by the complexity of migrations. WANdisco and Avanade have developed a methodology to minimize this migration risk and begin petabyte-scale data migration with zero downtime, zero data loss, and faster time-to-value.

In this session watch:
Timur Mehmedbasic, North America Data Platform Modernization Lead, Avanade, Inc.
Tony Velcich, Senior Director, Product Marketing, WANdisco



Tony Velcich: Hello everyone and thanks for joining us today. My name is Tony Velcich. I’m Senior Director of Product Marketing at WANdisco. With me is Timur Mehmedbasic. He is the North America Data Platform Modernization Lead at Avanade. So, we’re all familiar with the importance of data and how leading organizations are data-driven. As a result, we’ve really seen a greater number of data modernization initiatives so that organizations are better able to take advantage of all of this data that’s available to them. This usually means migrating the data to the cloud. However, these cloud migrations can be complex. So, during our session today we’ll provide you with insights and some important considerations for de-risking and accelerating your Hadoop to cloud migrations.
In terms of the agenda, Timur will begin by covering the first three points on this slide: the business challenges, the Hadoop migration drivers, and the business approach to Hadoop migration. I’m then going to cover the next three points which drill down into the Hadoop data migration challenges, what you should consider in order to address the challenges and the new capabilities that we’ve recently introduced in our product to enable a direct data migration to Databricks. With that, I want to turn it over to Timur to cover the first portion.

Timur Mehmedbas…: Thank you, Tony. As you said, let’s start with sharing the context around, what are the business challenges that we’re seeing with our clients? What are some of the Hadoop migration drivers and what is our approach to Hadoop migration and what are some of the key things to consider? As you shared, I’m with Avanade and I lead our data modernization business in North America. In terms of business challenges today and what we are seeing our clients are facing, it’s an unprecedented range of concerns that we see emerge with the clients across the business landscape. A couple of things I want to bring into focus specific to on-premise Hadoop workloads and migration and modernization topics specific to Hadoop clusters.
One, starting with the operational risk, across performance, scale, complexity, development cycle, and legacy risk, we are seeing these topics emerging out of discussions. Whether it’s an IT risk, whether it’s a vendor risk associated with existing Hadoop cluster deployments or people risk associated with skilling, managing the skill, managing the performance and addressing the complexities associated with the deployment and ongoing operation to sustain these client commitments. Extending into the financial risks, financial risks associated with again, the legacy models of that are primarily tied to CapEx the larger investments versus Ruda recognizing the business cycles and the operating models that our clients are operating in that they’re predicated on a cloud native capabilities and OPEX as a model that reflects the priorities and innovation and ability to quickly pivot them in the attack and address the emerging business needs. So with that, we let the business context in mind. What we are seeing with our leaders across different clients.
So whether those are CDOs, CIO, leaders across the business operations, et cetera, there are really three key themes that we see emerging. One being time to value is critical, but refined operations and processes are enabling operating in a new data ops to dev ops and then improved platform centric, business models, informed by data with AI and the applied intelligence being applied across the business value chains to inform and power up our next gen worklist and customer experiences. Building that with new capabilities in place, building them organizational muscle specific to maturing the AI capabilities enterprise. We see really five categories emerge.
It starts with a strategy and recognizing, and that strategy also includes recognizing how to eliminate the existing commitments, platform commitments, understand where the data gravity is and what are the key sources powering the existing capabilities, and then understanding how to carry those over into the new understanding, recognizing the talent in culture, hiring and reskilling talent, evaluating the work in change management, again, new skills, new IT, data ops, dev ops, digital ethics, technology and a process when AI platforms that makes automation business process application, and then all underpins these new models across the customers and workplace experiences are underpinned by data foundations.
Those data foundations are predicated on systems like Databricks, working in concert with addressing apps and others, really to provide these foundations to support the innovation and growth in these new emerging models we’ve talked about. So ability to manage and use and integrate and secure data and analytics is really foundational to support new business models and change. We recently did a survey across number of our clients with at 1700 clients participating in 15 countries in 19 industries. There are some common themes that we see emerging. 94% of the respondents see data supply chain and analytics is the most critical to scale AI within their organization. 65% of the participants reported to organizations that are quoting need improvements or reported that their organizations data quality needs are requiring improvement when it comes to supporting AI, and then over 90% of the organizations see that the data supply chain is the very important when working with AI.
Meaning in order to enable AI, the data foundations of the new, the industrialized data pipelines are predicated on a new set of capabilities. In order to enable those typically, it starts with the migration and migrating the existing legacy commitments and within those Hadoop clusters and on premise Hadoop commitments, they’re part of that migration landscape. So in terms of, how do we approach the Hadoop migrations and what is it business approach? It really starts with them introducing an Azure domain-agnostic data platform as an extension. We think in terms of shared data infrastructure in coexistence with on premise capabilities. It also includes aligning data and modeling or grouping that into business domains that reflect specific data flows. It also covers modernizing tools and platforms for consumers of the data from the cloud, meaning what are the industrial data pipelines and what are some of the business and where do business users participate in law altering capabilities as a part of that new data supply chain.
Then it really then with that foundation and the shared infrastructure in place that stands side-by-side, we don’t with existing commitments. Then we start with really reverse engineering the logic from on-premise legacy, integrated data stores to the cloud to incrementally really redevelop and those commitments so we can get to the data from the data sources. That’s a really big focus on the business domains. Prioritize those based on the strategy that’s informing your oval migration. Incrementally stack them. Be committing from them, from these what are the consumption experiences that are supporting the insights and action capabilities that are tied to these domains? How do you rationalize the data infrastructure and organization from user and experiences to data sources and that in order to eventually retire the legacy Hadoop systems and commitments. At Avenade, we call this a cloud data consistent extension pattern, as a means of de committing incrementally from a legacy Hadoop systems and clusters that are governing our client landscapes today. What’s the value for us?
Well, the business transformations are dependent on cloud adoption and it’s, it really comes down to or reflects what we’re hearing from business leaders and IT leaders it’s time to value. It’s a refining the operations and process, meaning applying data, incorporating dev ops and data ops into these new cloud enabled operating models. Then it’s improved platform based business models or these platform centric models that enabled a new experiences and new business models. Tying back to the Gartner conference last year, really the business value of AI for the organization will be the proportional, how thoroughly you reinvent your business in the light of the AI capabilities. That innovation growth, the reinvention starts typically with a migration aligning the migration priorities and the capabilities enabled by WANdisco to incrementally realize that business value through aligning the technology commitments of the new with those business objectives. It is going to be the difference between success and failure. That’s consistent with what we are seeing with our clients. Tony, I’m going to hand it back off back to you.

Tony Velcich: Great. Thanks Timur. So you’ve heard Timur just talk about the business challenges. I’m going to drill down specifically into the Hadoop data migration challenges. What makes these migrations so complex and what important things you should consider them when undertaking a Hadoop data migration project? The first consideration is really with regards to the scale of the data. So these Hadoop implementations are typically very large from hundreds of terabytes on the small end to multi petabytes or tens of petabytes in size are common. While there’s various options for migrating small data volumes, the same approaches don’t work well for large scale migrations. But regardless of the approach, there are really one of two ways to transfer the actual data. So you can do it over the network and you can calculate how long this will take based on the network bandwidth and the volume of data that needs to be moved.
You can see from the chart on this slide, moving a hundred terabytes of data over, say a one gigabit per second pipe will take about nine days. Migrating one petabyte will take over 90 days. Then if you’re talking 10 petabytes over this network bandwidth, it’s actually takes two and a half years. 926 days. These numbers are assuming that you’re utilizing the network at 100% capacity the entire time. Unless you have a dedicated network for the migration, there’s other workloads that also need to use the network. So you’re not going to be able to use it in its full capacity. The alternative method is to use physical data transfer device. Of course, this takes time as well. So, in order to estimate how long that will take, you’d need to consider the transfer rate at which you copy the data onto the physical devices, how many devices you’re going to need.
For example, one petabyte of data usually takes about 25 devices and then how long it takes to ship those physical devices to the cloud providers, data center. Again, the transfer rate to move that data off the devices and onto cloud storage. So there are things that you can do to make this process go faster, such as having a dedicated network and with greater bandwidth. So you’re doing things faster or using higher scale transfer devices. However, eventually, the laws of physics do come into play. Migrating large data volumes simply takes time. So you need to be able to plan for this.
The next important consideration is with regards to the amount of ongoing data changes. So over the years, these Hadoop environments have become business critical and very active. So it’s not just the volume of data that adds the complexity to the migration, but really the amount of data changes that are taking place. So this includes new data being ingested, as well as existing data being updated or deleted. The chart on this slide actually comes from one of our customers. So we measured this customer’s workloads and we’d measure peak loads approaching 100,000 file system events per second, and average loads over a 24 hour period. We’re actually about 20,000 file system events per second. So, how do you perform data migrations when the data is actively changing like this? This issue is actually cited as the top challenge by respondents to a recent Hadoop migration survey that we conducted and typically what we’ll see is customers choosing one of the following.
They simply one don’t allow changes to happen. However, this means that customers bring down the production systems so that no data changes can occur, but like we saw in the previous slide, these data migrations for large volumes will take quite a bit of time. So can you really afford to bring the system down for say, weeks or longer for most organizations, that’s really just not acceptable? So the next option then is they start to develop custom solutions to manage the data changes either manually reconciling, leading to changes that are happened or developing their own custom scripts, that re-scan the data and do multiple passes to look for changes since the prior pass.
But depending on the volume and the amount of change, one of the issues we’ve seen is maybe impossible to ever fully catch up with this approach. Eventually we’ll see organizations again, bringing down the system to catch up with the changes. So again, causing business disruption, a third option is to leverage tools that are designed specifically to handle these data changes. That’s what WANdisco does. We provide solutions that do handle large volumes of data while the data is changing. We’ll talk about this in a little bit morning time.
The third important consideration is regarding the number of manual processes or the amount of custom code development that’s required. So that these legacy migration approaches that we see customers attempting usually result in very significant impacts on their IT resources and are the reason that most of these migrations go over time and go over budget and may never even fully complete. So the complexity of handling these on-premise data changes in the number of IT resources required to perform and manage the data migration, that was actually cited as the biggest items impacting migration costs. The graph on this slide shows how our respondents indicated that they managed or plan to manage these data changes. You can see that over 50% do plan to use some software to automate the migration of the data changes. However, even here you have to be careful because one of the tools that enterprises attempt to use to help them with this is this DISTCP or distributed copy.
It’s a free tool that’s included with the Hadoop distribution, which is why many organizations try and use it first. However, while the tool is free, like the slide says, there really is no free lunch. DISTCP, it was designed for intercluster copy of data of static data, really at a specific point in time. It doesn’t support automatic migration of ongoing data changes. So again, to handle those changes requires custom code and scripts to be developed around DISTCP, to perform multiple passes and figure out what data is changed and enterprises are often surprised that the complexity of developing these custom programs and the number of resources required to develop and continue to maintain this custom code. So really automated tools provide a better option.
So that leads me to this slide, which really summarizes what we’ve been discussing. It shows how, WANdisco is uniquely able to handle the two key dimensions and we’ve been talking about, any volume of data and any amount of data change. So it really minimizes the costs and risks of these data migrations, which are associated with these other legacy approaches. They either just handle static data or they require system downtime or multiple passes, or are very resource intensive, requiring manual processes or large development efforts like we talked about.
The other thing that WANdisco provides with our solution is actually a freemium approach or free trial where organizations can begin to migrate their data for free, with zero risk, and then scale that up without any new installations and scale that up really to the largest data volumes. So thereby really enabling WANdisco to provide the optimal solution at any points along this chart. Okay. While WANdisco, obviously we support all of the major cloud service providers, including Amazon, Microsoft Azure, Google cloud, and others. We have a very tight integration with Azure and so last year, I believe that at this event, we announced that we integrated our live data platform with Azure resources, such as Azure Portal and CLI, security, manageability, billing, et cetera. This tight integration means that you can actually deploy and manage WANdisco services as you would really any other native Azure services.
Also the billing integration means that customers are charged on their Azure bill on a pay as you go model. We provide two core services on top of that platform, so LiveData Migrator for Azure, which is, it’s used for that initial migration of data, as well as replication of all ongoing changes as they occur in the on-premises environment and replicating those into Azure delta lake storage. The second offering is LiveData Plane for Azure, which actually provides an active, active replication across distributed on premises and cloud environments. So this really enables you to maintain even a hybrid cloud capabilities for as long as you need, where you can make changes in any environment and have those be replicated to the other participating environments.
Then by doing that, it leverages our distributed coordination engine to ensure 100% consistency across all of those participating environment. A big focus for us with LiveData Migrator for Azure has really been to make it very easy to use, make it a self-service solution. This slide summarizes the three steps that are required in order to deploy and begin the migrations all from within the Azure portal. So the screenshots here are of the Azure portal. From there, you can get the product, which creates the resource in Azure and provide simple instructions on how to deploy the LiveData Migrator on the edge note of a Hadoop cluster. So yeah, so the LiveData migration agent gets deployed on the edge node of the Hadoop cluster and once deployed LiveData Migrator automatically discovers the source HDFS cluster. Then you use the portal to define your target environment and define your migrations. So you can define exactly what data to migrate, which data to exclude, very flexible that way. Then auto start the migration and basically that begins moving data from on-premises against ADLS.
Okay. Then lastly, I wanted to provide you with information on the direct migration capabilities that we recently introduced in our core LiveData Migrator offering, and which will soon be available in LiveData Migrator for Azure version as well. So this provides what we call the last mile transformation of data directly into the format required by the Delta Lake tables running on Databricks. So this further decreases the time, the cost, the complexity of migration by leveraging a single tool to manage both the Hadoop data and hive data migration, all the way into Databricks, and by fully automating that data migration, your resources or the SIS such as Avenade can focus on the workload migration says is being discussed. The more strategic AI and ML development and thereby really shortening the overall time to value of the migration.
So in summary, you heard Timur to talk about the business challenges of organizations that are facing the value that AI can provide, and really the importance of that data supply chain, to be able to scale AI within your organization. I then drill down into the Hadoop data migration aspects. We talked about kind of three important considerations. The volume of data to be migrated, the amount of data changes, and then looking at the manual processes or custom coding requirements. I went over how automated migration tools and specifically WANdisco’s products allow these migrations, these data migrations to occur while your production data continues to change. So productions stays unaffected, continues to operate as normal, no system downtime, no business disruption, and thereby enabling your SI and your resource to focus on the strategic development and do follow that incremental approach and optimization and leveraging this approach, Avenade and WANdisco’s really do help organizations accelerate. Excuse me, their move to Databricks. So thank you everyone. With that, I think we have some time for Q and A. So see if there’s some questions out there, thank you.

Timur Mehmedbasic

Read more

Tony Velcich

Tony Velcich

Tony is an accomplished product management and marketing leader with over 25 years of experience in the software industry. Tony is currently responsible for product marketing at WANdisco, helping to d...
Read more