Accelerate Spark-Based Analytics by Minimizing the Barriers of Large, Complex Hadoop Migration

Enterprise organizations recognize that on-premises Hadoop environments have become overly expensive to run and maintain and cannot effectively meet today’s Data + Analytics team demands. Business critical initiatives require moving petabytes of data to the cloud in the pursuit of digital transformation initiatives, cost optimization, IT agility, and implementation of AI and Machine learning. However migrating data from Hadoop to Azure to begin data modernization and Spark-based analytics is often stalled by the complexity of migrations. WANdisco and Avanade have developed a methodology to minimize this migration risk and begin petabyte-scale data migration with zero downtime, zero data loss, and faster time-to-value.

Speakers: Paul Scott-Murphy and Alan Grogan


– Hi, this is Paul Scott-Murphy from WANdisco. I’m responsible for WANdisco’s product strategy, and have been working towards the recent announcement of the open preview of our Azure-native service for Hadoop migration, which we call LiveData Migrator for Azure. WANdisco builds technologies that helps organizations accelerate their adoption of cloud services and infrastructure at scale. And I’m really pleased to be joined today by Alan Grogan from the data and AI business at Avanade. Alan brings a vast experience of working with companies that want to place cloud strategy at the center of their data and AI planning. He’s got a really keen understanding of the need for data teams to combine the right type of support tools and data to power their transformation strategy. Now, the need for solutions that bring big data to the cloud is broad and increasing. So Alan and I will talk in some detail about how and why organizations are moving away from platforms like Hadoop on-premises, and why the cloud is such a fundamental part of that trend. Alan will cover a bit about how Avanade’s clients are moving their critical and non-critical business systems to Azure, and about the decision processes that govern options on how those migrations are run. So Alan will describe the methodology that Avanade brings to their clients to reduce risk, accelerate cloud adoption, and how that can take the organization’s goals into account. So the topic of conversation here is around accelerating the adoption of Spark-based analytics by minimizing the barriers associated with large and complex Hadoop migration. Behind all of that, of course, is the people aspect. It’s not all about IT strategy, risk mitigation, cost reduction and expanding on opportunities, but it can also be about the people involved, right? Their individual goals and drivers, and the opportunities that cloud migration presents them personally. So to begin with, WANdisco has taken a unique approach in developing with Microsoft, a native Azure service to bring on-premises Hadoop data sets to Azure without downtime or disruption. We call this LiveData Migrator for Azure. It can be used in exactly the same way as Microsoft’s own Azure services, providing deep integration with the Azure portal, with Azure security, configuration and management, and indeed with metered billing that makes its use a very standard part of your Azure subscription. So with that in mind, Alan and I will talk jointly about what the barriers are around migrating large and complex Hadoop environments to modern analytics built on Spark in the cloud. I’ll discuss briefly the central importance of eliminating the constraints that come with inadequate tools and technologies for large scale data migration, and some of the innovation that WANdisco has led with Microsoft’s help there. Alan will talk more to the aspects around successful change programs for data modernization that use the cloud, what they look like, some of the driving factors behind cloud migration, and the types of things can be achieved by reaching that target state. So to start with, the concept of migrating big data to the cloud is increasingly fundamental when planning data modernization. The scale at which modern data analytics and machine learning platforms operate really makes the decisions on how to approach that really critical for organizations. The disruption, traditional IT practices that the cloud has introduced, has really established two speeds of IT outcomes with those organizations that can’t or won’t take advantage of the types of gains that modern AI machine learning and cloud infrastructure make possible, really falling rapidly behind those organizations that do take advantage of the same. Beneath all of that, of course, data is the common foundation that underlies those trends. The exponential growth in data, access to it, and the volumes of data with which organizations need to work make possible the gains that come with the benefits of large scale cloud adoption. But the scale of that data can also be the anchor that holds back cloud adoption. So let’s look at some of the key factors that drive that adoption process for organizations with large data sets. With your data in the right place, of course, pretty much anything is possible. That’s normally described in terms of refining the processes that rely on those data sets, and in terms of capitalizing on the data itself, reducing time to insight, reducing time to action. And I’m sure you’ll hear from a broad range of presenters in the conference overall about the advances being made in Spark-based analytics infrastructure with Databricks and other platforms, but our concern today, and the topic of discussion is in the adoption of those platforms. So WANdisco works with organizations that are challenged by the complexity of risk and cost of cloud migration. A few examples of organizations using WANdisco’s LiveData cloud services include in the first instance, a fast moving consumer goods and retail organization founded in the U.S. Northwest. They needed to adopt Azure Databricks on Azure Data Lake Storage Gen2, but had about three petabytes of data held in another storage system. So in addition to the benefits that would come with the cost reduction of that move and the process optimizations they could enable along the way with that move to a more modern data architecture, they also needed to address the ability to maintain continuous access to that data lake, both for their own business and those of their supply chain and other commercial partners throughout the migration process. To do that, they took advantage of WANdisco’s LiveData platform to complete that migration without disrupting the use of their existing enterprise data analytics platform, allowing them to continue to ingest data and report and analyze it throughout. A second good example of the types of organizations WANdisco works with, a domain name registrar with customer data spanning about two and a half petabytes of content, residing in an on-premises Hadoop deployment with a need to move to cloud infrastructure. That included a shift from a home grown approach to operating Spark and Hadoop environments on-premises to public cloud services. Really changing from an organizational ethos of building the infrastructure and processes for themselves to better taking advantage of standard cloud services that operate at scale along the way. Similar to the retailer, the continued availability of those source data sets throughout the migration was paramount. In this case, their source environment consisted of a 700-node cluster, performing on average about 25,000 data changes every second as the source system. Their goals included the initial migration of the most critical elements of those data to the cloud in just a matter of weeks. Third example of an organization using WANdisco’s cloud services, a dominant U.S. telecommunications provider, obviously with access to very large amounts of bandwidth, but faced with the challenge of migrating tens of petabytes of constantly changing Hadoop resident data from a cluster of about 1,000-node on-premises. Elements like call data records and the like being updated in real-time with constant feeds of information. So you can understand the common challenges that surround these types of organizations and the sorts of things that they need to overcome at the technical level are the types of elements that WANdisco’s technology is designed to address. So some of those include the security of data, right? How do you ensure that data under migration retain the appropriate security controls around access, encryption, data lineage and the like, even while those data sets are being transferred between platforms that may differ significantly in the mechanisms that enforce those security controls? The timeliness of migration. Cloud migration projects, of course, don’t come without completion goals. While many organizations look to adopt a hybrid architecture, retaining both on-premises systems and cloud in perpetuity, others look to eventually cut off their existing on-premises platforms and are normally faced with timing challenges around that. Perhaps driven by vendor contractual constraints or their strategic business goals as part of a larger transformation outcome. A third key challenge that we see commonly is around business continuity. Of course, at scale, and even with those other goals in place migration takes time. And during that process, really no organization operating at scale can afford the luxury of business downtime just to service the migration. And that really makes traditional approaches to data migration entirely inadequate as they can require to stop the ingest or modification of the data that are under migration. So WANdisco provides technologies to help solve those problems, fitting that in as part of a larger approach and a strategy around migrating to the cloud. So our role in that is the provision of the LiveData platform, including its incarnation as a native Azure service to help eliminate some of the risks and challenges around those key elements, including the business continuity by making it possible to migrate data sets even while they continue to be used. The timeliness of that migration. By being able to take advantage of as much network capacity as you like, online Hadoop migration to the cloud can actually be achieved much more rapidly than approaches that take advantage of physical transfer devices. And then thirdly, the security of those systems. By implementing deep integration with the target environments and by supporting the transfer of the metadata that surround Hadoop on-premises, the permissions, the ACLs that sit around data at scale, along with leveraging secure transfer protocols, WANdisco’s technology in the LiveData platform makes it possible to eliminate some of those risks and challenges. So we really see the market for cloud migration evolving over time between three phases. And WANdisco’s technology attaches itself to each of these phases, with cloud migration, a hybrid operation, and multi-cloud being the key stages that we see organizations moving through. But in general, this is very much an emerging market and an emerging challenge for many organizations. Most companies or businesses that operate significant data sets on-premises residing in Hadoop are really only recently facing the challenge of adopting the cloud at scale. But of course, we have to note that there are exponential trends involved on the target side; improvements in cloud scalability, the functionality and services really make the cloud migration increasingly important for every organization. So I’ll hand over to Alan to bring his experience and insight from Avanade to expand on some of the trends that I’ve introduced and describe how minimizing the barriers of cloud migration is more than just the selection of the right technology partner. And really to talk to how Avanade, Microsoft, and WANdisco together can offer some important advantages for a move to the cloud with a methodology that eliminates risk and can help you begin today. Over to you, Alan.

– Thank you, Paul. Hello, everyone, my name’s Alan Grogan, and I’m here with Avanade. So thank you for joining this presentation. I’m here to just go through a little bit more on the approach methodology, but before I do that, I want to go into some of the background of why Hadoop migration and data modernization is so important today. So let’s start with an icebreaker. Let’s be clear about it. There is really no digital transformation without cloud adoption. Business challenges that we’re facing today are unprecedented. And many of those I’m listing just now, you should be experiencing in your current day-to-day activities, whether you be in the business, back office or middle office of your company or institution. Now, when you look at the area that we’re focusing in on here on Hadoop migration and data modernization, well, Hadoop really hasn’t helped us, has it? Because it’s suffering significantly with performance. It doesn’t scale too well. It’s too complex. Its development cycles are too long. And actually, dare I say, Hadoop now is legacy. So when you stand back and actually look at a Hadoop-based data and analytics platforms that you’re dependent upon, it’s not really helping us. When you then look at other things like financial risk and then within the area of funding money’s tight today and money will continue to be as organizations re-pivot their business models to be a lot more platform-centric. Now, your on-premise, the on-premise way of doing things really doesn’t help us when you’re trying to move onto an OPEX funded business or platform model. So Hadoop, again, doesn’t really cut it for us today as cloud companies increasingly look to support funding of OPEX via CAPEX and vice versa, via innovations or investment schemes. Again, I just don’t feel that Hadoop is there to help us. You see, leaders really want to double down on operational excellence. Too many CEOs, CDOs, CDIOs, CTIOs, they all want operational excellence. It’s what I see, what I hear right across Europe. What they really want is three things. They want time to value a lot quicker. They want to redefine their operations and their processes end to end. And they really, as I mentioned earlier, they want to pivot from business-based models to more platform-centric models. Again, ask yourself, if you’re working, developing alongside Hadoop, it’s not really helping you. You see, when you think about maturity or AI maturity across five dimensions as we do, you’ve got to have the real vision around taking your tech stack with you. Now, Hadoop, it was great and were all sucking, I guess, the Hadoop Kool-Aid for six, seven years ago in those great events. But we were acting on this wonderful premise that Hadoop was the be-all and end-all for the foreseeable future. Strategy has changed. Obviously Databricks has revolutionized the market and we see new things like Azure Synapse now coming to build alongside capabilities like Databricks. Hiring and re-skilling. Well, if you’re a Hadoop developer you’ve done quite well, but actually, the world has pivoted because of some of those risks and issues, costs and challenges that I’ve mentioned. So from a maturity perspective, Hadoop isn’t something I’m seeing great investment in. Digital ethics we’ll skip, but this is an area that I’ll be covering in a separate presentation with Databricks in this conference. Technology and process in both data supply chain. Well, they’re very critical and very important to me and to Avanade. As clients increasingly look to the next wave of innovation, the next wave of value. And those three years I talked about from a leadership perspective, how you can really truly drive operational excellence. So over the past two years we’ve been doing a huge amount of research with our customers at Avanade. Over 1,700 participants at a leadership level have been interviewed. And we’ve got some really interesting insights regarding their thoughts and views on the data supply chain specifically. Now, the first point is 94% of those respondents are seeing the data supply chain as being the most critical aspect for them to scale their AI, their AI needs, AI vision, across their respective organizations. Now, to scale AI, those of you that listening in who’s successfully done it, you know it’s not easy, and there’s clearly a lot to learn along the way. But the complexities, the long development cycles with Hadoop, but to name two issues, means that Hadoop really doesn’t cut it when organizations look for a dependent or dependable stack to grow, develop, scale, and then evolve their business with. Data quality continues to be to be an issue when it comes to organizations wanting to scale and develop. So that’s point two. Now, Hadoop, again, it’s not a stack that compliments end to end, whether it be a master data manager or a process of master data management. It needs additional components. It’s not a be-all and end-all unified data analysis platform the way, say, Databricks is. So what we say is that organizations who are adopting Hadoop are starting to think more on that unified level as they move forward. And then finally, just over 90% of organizations, they do get that data and AI supply chain will be very important once they have adopted AI and move forward with it. So I’m going to spend a bit of time now just explaining the next piece of what our research tells us. So we spent a bit of time asking the same respondents about where their destination clouds would be for their migrated data supply chains and analytical processing. So as Avanade, we support Microsoft, Microsoft Azure end to end. So we’re very glad actually to see the results that Microsoft was a number one primary destination for all this workload. Secondly, followed by Google. Thirdly, IBM, somewhat surprisingly. And then fourthly, AWS. Now, I posted the results of this survey on LinkedIn about one or two months ago. There was a huge discussion about the detail behind it, but it’s true. And so, congratulations to Microsoft for all the great work in this space. And specifically Microsoft with Databricks supported by WANdisco as a partnership to make this a real destination that organizations can trust. Now, the whole approach to Hadoop migration is this. You see, in Avanade, we follow a six-stage process. We like to assess or take a phased approach to assessing with our clients, the right introduction of for us and Azure, because there’s a lot research and because we’re a Microsoft-centric company, we look to introduce an Azure-dominant agnostic data platform as an extension of the client’s existing ecosystem. So whether you be a hybrid multi-cloud organization going on that journey, this is something we would look to impart you of embracing as the first step. The second part then is about helping you productize your data, your datas and models, into those business domains that reflect the data flows rather than, shall I say, being business organized where we are today. The third step in that process to Hadoop migration is modernizing your tools and your platforms for the consumers of the data directly from the cloud. Eventually that’ll allow you to reverse engineer the logic from on-premise legacy data warehouses to the cloud. The next step would be incrementally redeveloping using modern data movement methods to directly allow you get access to those source systems. And then finally you’re ready to retire, to decouple the legacy Hadoop or monolith system, as my colleagues would say. Now, this type of approach Avanade calls, we haven’t got it down to a nice acronym there, so apologies, but it’s the Cloud Data Co-Existence Extension Pattern. It’s what we’re using for some of Europe’s largest companies as they look to decouple themself, or should I say, unshackle themself from Hadoop. Which has done very good things for them to now, but really, the next way is to embrace things like Azure Databricks. See, that’s where WANdisco comes in for us. You see, there are all those silos. We continue to be in this area in 2020 about business or technology silos. Unfortunately, Hadoop is one of them due to the reasons I explained. And not only is it a silo, it’s an expensive, complex, and should I say, difficult position to find yourself in. So as long as you’re not investing in Hadoop, you’ve got options clearly for you while listening to this presentation. But WANdisco for us is a really strong accelerator and we look to help clients drive that cloud adoption. You see, we continue to see and continue to drive the value for our clients as they move across to Azure Databricks or Azure Databricks plus Synapse. You really are getting those time to value, those redefined operations. You’re getting the cost savings of the 70, 80, 90% after legacy stacks. And they clearly are then being enabled to move to the next level of their transformation. So before I hand you back, I just want to just finish on this quote from Gartner back in May this year, which Paul talked about the proportionality or at least the exponential dependency of the types of things you wanted to be doing on your stack today. So what this really ties to is the business value of AI for your organization will be proportional to how thoroughly you reinvent yourself. Now, to reinvent, you can’t do it with Hadoop in my personal and my professional opinion. You need to now to start to let go and look at those options, whether you want to do it on a two-speed IT basis, or if you’d like to engage in something a bit more bespoke, clearly WANdisco and Avanade can help you on that journey. I’m gonna now hand you back, Paul. Thank you very much.

– Thanks very much, Alan. And thanks too for the audience and your participation today, listening to our presentation on accelerating Spark-based analytics by minimizing the barriers of large complex Hadoop migration. Hopefully you’ve come away with some insights around how a transformation strategy led at the business level in combination with the right underlying technologies for cloud migration can help your organization better adopt cloud services. We’ll be available, I think, for the next few minutes to help answer any questions that come from the presentation, and we look forward to seeing you. You can contact Avanade or WANdisco at our websites. And of course, WANdisco at the Twitter handle you see here. Thanks very much.

Watch more Data + AI sessions here
Try Databricks for free
« back
Paul Scott Murphy
About Paul Scott-Murphy


Paul has overall responsibility for WANdisco’s product strategy, including the delivery of product to market and its success. This includes directing the product management team, product strategy, requirements definitions, feature management and prioritization, roadmaps, coordination of product releases with customer and partner requirements and testing. Previously Regional Chief Technology Officer for TIBCO Software in Asia Pacific and Japan. Paul has a Bachelor of Science with first class honors and a Bachelor of Engineering with first class honors from the University of Western Australia.

About Alan Grogan

Avanade, Inc.

Alan is passionate about using the latest modern technologies responsibly to reduce organisational wastage, ineffectiveness and overhead. His tenacity and foresight has helped major organisations including Airports, Banks, Media, Manufacturing and Government departments change their operations to better use data, and it has seen him ranked in 2020 by DataIQ as the #1 Data and Analytics Leader in the UK for client enabling/ advisory.