Accelerate Analytics On Databricks

May 26, 2021 04:25 PM (PT)

Simplify Hadoop migrations to the Azure cloud to accelerate analytics on Databricks.

Enterprises are investing in data modernization initiatives to reduce cost, improve performance, and enable faster time to insight and innovation. These initiatives are driving the need to move petabytes of data to the cloud without interruption to existing business operations. This session will share some “been there done that” stories of successful Hadoop migrations/replications using a LiveData strategy in partnership with Microsoft Azure and Databricks, including:   

  • How to manage large-scale business-critical data migrations to the cloud with zero downtime and zero data loss even as the data is actively changing.
  • How to move your critical data to Databricks in weeks vs. months or even years.
  • How to minimize the risks associated with large scale data migrations by using a LiveData strategy to automate the migrations and enable your engineers to focus on new AI and ML development that drives innovation.
In this session watch:
Jeff King, Director, Microsoft
Paul Scott-Murphy, Chief Technology Officer, WANdisco



Jeff King : Hi everyone. It’s Jeff King here from Microsoft. I’m here with Paul, Paul introduce yourself.

Paul Scott-Murp…: Thanks Jeff. This is Paul Scott Murphy from WANdisco. I’m the CTO, very pleased to be able to present today to the summit audience. I think our topic of discussion is around accelerating analytics on Databricks. We’re going to be talking around a whole range of agenda items, but mostly covering details of the native Azure service that WANdisco and Microsoft have been working on for some time. And explain how that fits into a cloud migration strategy, particularly talking about how Databricks is the target environment for analytics platforms that have previously been locked up on premises. Jeff?

Jeff King : Awesome, Paul. So let’s yeah, let’s just dive into it. Shall we? As you mentioned, we’re going to talk about modernization and migration of your own premises to do workloads and how we’re going to help solve a lot of the problems and challenges with that, particularly with moving the data migration. And then of course, the session wouldn’t be complete without some really cool demos. So we’ve got some really interesting ones prepared for you folks. So let’s dive into it. Yeah. So some of the goals that you have with data modernization is, I mean, these are all pretty straightforward, right? You want to either reduce costs. You want to maintain competitive advantage. You’ve outgrown your on premises infrastructure, or it’s just become too costly to manage.
And cloud-scale analytics is ready for prime time. It’s mainstream, it’s enterprise grade, the public cloud platforms as a whole have really matured and come through with a… And do provide a swath of analytics capabilities to meet each and every one of your needs. The challenge of course is moving your analytics workloads while keeping the proverbial lights on, right. And a good metaphor is moving a city without knowing its inhabitants or without its inhabitants ever knowing. How do you move a city without anyone ever knowing, right. It’s really, really, really hard. Or how do you move a building, right? How do you move a building while it’s still in function? Found this cool little video? Paul found the video actually, but it’s very appropriate. Here as the Indiana bell building that was moved in the 1920s, 1930s.
And it was like in the late twenties and it’s essentially, so they can make room for the new building. Right. But they couldn’t get rid of the old buildings because it was providing so much value. And so they literally picked up the buildings, weight around 11,000 tons and people were still going to work in it. Everything was still… All the pipelines, all the infrastructure, everything was still connected while all that was happening. That’s essentially what a data migration is like. Yeah. It’s like moving a building, right. Or it should be more like moving the Indiana bell building, right. Where people are able to go on about their day, business processes are still working. They’re not interrupted. Dashboards are still being processed. And, yeah, business keeps going, right. Just like the Indiana bell building.
The problem is, is that they’re not, right. The data lake migrations are not like the Indiana bell building. They are challenging. They’re arduous. There was no good time to migrate, customers have high expectations and so on and so on. So, and as platform providers, Microsoft, Databricks, and WANdisco, we are working together to provide a more seamless migration experience for each and every customer. Right. And so, yeah, let’s get into it. We’ll show you what we mean by that. Right. So today we want to introduce the live data platform for Azure, right? This is a turnkey solution, right? For migrating your data. Here’s what I mean, let’s go through a quick animation. So your data is sitting here in your own premises cluster, but you need it here in ATLs Gentoo. So you can take advantage of cloud scale, compute, service analytics.
So direct your Databricks, synopsis, HDI, and so on. How do you do that? Well, with live data migrate it for Azure. It’s pretty simple actually. Your journey it can either begin either from the portal or from the Azure portal or from the Azure CLI, the choice is yours. Just like it is with every other Azure native service. You create a migration resource and provided the necessary configuration values. And then you deploy it. The agile resource provider will then build a binary package, which will with represents essentially the Migrator agent, which you then download and install on your own premises environment. And at that point, you’re off to the races. You know, you’re migrating your data and your metadata changes seamlessly into Azure without any interruption, any downtime on your cluster whatsoever. It’s always running. It’s always on. Right. And the great thing about it is that it’s a native Azure experience.
So what do we mean by that? It means it’s Azure portal. As you can see a live, right? It has a Azure resource provider, it supports arm, right? You can deploy it using arm templates. Like you can any other Azure resource today. It has, it is a metered service, meaning it follows the sort of pay as you go model, billing model. Like every other Azure service does, you pay for only what you use. And it’s the first of its kind, essentially, right. You’ve got an Azure, you’ve got resources that are being deployed locally outside of the Azure environment in your customer’s environment, but it’s still managed by the same sort of Azure control plane, like your other Azure services that are running in Azure. Right. Paul, do you want to add anything, anything to this?

Paul Scott-Murp…: Absolutely Jeff. Thanks very much for the introduction to the live data platform for Azure. One of the unique elements about it, of course, is not just that it is a turnkey solution. You can introduce it without a change to your on premises environment and without really regard to how you have a need to configure and use the services on the Azure side, you can retain the use of those as you normally would, but just like the structure relocation of the Indiana bell building, it can be introduced without change or disruption to that on premises infrastructure. The applications that are operating there continue to work throughout that process besides allowing them to continue to function, or the migration of data takes into account the changes that they make to those on premises, data sets while migration is underway. One of the other aspects that we were kind of pleased to be able to talk about today is the capabilities that have been built into the live data platform for Azure to specifically target Azure Databricks as the destination for your analytics environment.
Part of what we’ll show you shortly through the demonstration is exactly what you need to do to configure the platform in a way that allows you to immediately use data that are locked up on premises in Hive content, but access those directly from your Databricks workspaces. So the benefits that come out of the live data platform for Azure, obviously those are business continuity, the ability that you can introduce the technology without any downtime or disruption to your on premises infrastructure, but also that mechanism by which you can immediately access the data that become available in the cloud from platforms like Databricks. It does all of this and environments that are operating at scale. So we can talk to examples of customers that are using the technology at significant scale with large on premises infrastructure. And it does it in an efficient way as well. So the complete and continuous migration of data imposing only a single scan against the source environment.
And combining that with information about the ongoing changes to the data and metadata, to ensure that your target environment is brought up to speed as quickly as possible. The outcomes of that, one of the benefits of course, is minimizing the risk associated with migration because you’re not disrupting or changing your existing platforms in any way. Those office workers can continue to sit in their office and the gas lines and electricity remains in place while the migration is in effect. So de-risk the migration by allowing your operations to continue, it automates it as well.
So the fact that it’s available as a native Azure service means that you can provision, manage and configure and monitor this type of deployment in exactly the same way you would, for any other type of Azure resource. You don’t need to take advantage of custom means of monitoring or managing the environment you use exactly the same tool sets and infrastructure that you have in place for your existing Azure investment. All of those things combined means reducing the time required to complete a migration, the lowering of the risk and lowering the cost around it. Jeff, I’ll hand it back to you.

Jeff King : Thanks. Thanks, Paul. So as we’re about to show you in, in, in the demos that we’ve prepared, just want to just highlight the fact that this is a turnkey solution, right? This is a… In a matter of minutes, you can deploy a migration solution that migrates your data seamlessly, captures every single change without any downtime or any interruptions in your own premises environment. And you know, it frankly, the cluster doesn’t even know the agent is running, right. It doesn’t need to know, right. The agent just needs to be able to capture each and every change and replicate those changes into ATLs Gentoo. And it going to do this seamlessly without any avenue without eating worry or concern about scheduling a batch or having to do this in offline hours or anything like that.
Some of the things that we’ll show you in the demo is, obviously we’re going to show you the portal experience. We’re going to show you that real soon, you can go into the portal. It’s really simple. It’s just a few clicks, right? You create a migration and you create a target and you create a source. And then you specify a few things, such as, how you want to access your storage account, which is the target. Your authentication or authorization, whether using a native or whether you want to use the, sort of the old access key or the more enterprise grade aloft capabilities powered by Azure active directory. Live data Migrator for Azure supports that, right? So you can create a service principle, give it various permissions on the storage account and then tell live data Migrator, use a service principle, right?
That way you are all, you still maintain control, always. You’ll be able to monitor all of the migrations and the status of all those migrations in the Azure portal. Or you could also use the COI if you like as well. Most people choose the portals. So that’s why I’m always talking about the portal. Some of the other things we can talk about are logging in and metrics. We’ll have that in the coming days. So it will have metrics and various migration related metrics that you can see that are visible in the portal as well. And finally, if you have any support issues, if you have any troubleshooting or you need some help troubleshooting, you run into an issue, you can call Azure support, right?
You go through the same support experience as you would, any other native Azure service. You know, you click that little question mark in that Azure portal and file a support topic or a support ticket under the live data Migrator for Azure topic. So we mean very much that this is a native Azure service, even though it is built by a WANdisco. It is built and operated by, by WANdisco, but it is still a native Azure service by all measures, by all definition. So it’s really excited. I’m really excited. You just probably can’t tell because I’m a little under the weather right now, but this is actually a fantastic thing. So let’s just go ahead and dive into the demo. Paul take it away.

Paul Scott-Murp…: The migration of data to the cloud begins of course, with defining the storage account in which your data will be located. In my case, this is an ATLs Gentoo storage account for which I have a single data, AI summit 2021 container, currently devoid of any data it’s empty, there’s nothing within it. My demo environment of course includes a Hadoop source. In my case, this is an on-premise cluster, a quick review here of its content shows the CA data demo directory or demo data directory, which I’ve got a collection of retail customer data. To actually perform our migration, we deploy an instance of live data Migrator for Azure. The process begins in the Azure portal where we can search the marketplace for a live data Migrator, and then create an instance of that. And the experience that you work through here is exactly equivalent to using any other native Azure resource.
We need to provide details of the resource group, into which it will be deployed, the region and we give this Migrator instance and name, I’m going to call mine data summit. We can choose whether or not we want to use a test Hadoop cluster in our case, we’ll against my actual on premises cluster. With that information provided, the Azure portal validates that I’ve provided the correct information. And we click create to begin the process of deploying that live data Migrator for Azure instance. This just takes a few moments in Azure. And once that’s done, we’ll then go on to configure and deploy an on premises agent to allow the technology to scan and collect ongoing changes from my source Hadoop cluster. So that has been successful now, we see the details updated there momentarily. Including an installation command, which we’re going to use right now to deploy that on premises agent.
So we copied the install, command and paste that into an edge node of our cluster. It takes just a few seconds to deploy the agent. It doesn’t require any service restarts or any reconfiguration of that on premises, Hadoop environment. Absolutely no change there at all. The cluster and applications continue to operate as they did before. With that agent deployed, we risk refresh to see the status being up and we can now configure live data Migrator. The first step that we do is we define a target. This is going to reference our Gentoo storage account. We give it a name I’m going to call it Databricks target, choose that WANdisco Databricks storage account and the container within it. In my case, I’m choosing to authenticate using an OAuth to service principle that I’ve previously given the permissions to communicate with that storage account and create content within it.
I need to provide it’s secret. And then we review and create that target definition. Once the Target’s created, which again, just takes a moment. The final step in actually initiating migration is to create an instance of a migration. To do that, we returned to the live data Migrator instance. So adding a migration is just a matter of me selecting the location within my source file system that contains the data I want to migrate. We give the migration and name, reference that target that I created a moment ago. And then we specify the path in my source storage account.
I need to give the target a correct name first. So in this case, we’re going to choose that CA demo data directory. There are migration settings to decide what we do with existing data at the target, and whether we want to automatically start that and enable live migration, which I’ll come back to in a moment. Creating the migration though, with that auto start, we’ll immediately begin the process of migrating the data from my CA demo data directory into my 80 list, Gentoo storage account and the container within it. The deployment of the migration, again, just takes a few moments, communicates with the on-premise agent, which is where the actual work takes place.
That’s done now, we can go to the migration resource and just view its status, showing that it’s now running, having successfully provisioned. With the migration in place, we can then monitor its progress to do that. We can look at any of the metrics available from Azure. In my case, I’m going to go to the storage account itself, here looking at that data AI summit 2021 Gentoo storage account and the WANisco Databricks storage account in at resource group, I should say. And we can look at the ingress of data within it. So going here to the details of our ingress, we see the information present from just a few minutes ago. Let’s restrict that to the last 30 minutes and then refresh the view to get an up-to-date representation of the data ingest. So here we can see the non refreshed view. I’ll click refresh, where’s refresh? Up there on the top left, click refresh. And we see there, the data ingested in the last few moments as my migration has begun. I can of course, review the content in the Gentoo storage account itself.
And we can see that data there. Equivalently, I can now, once I’ve got that data available in Gentoo storage, access it from my Databricks environment. I do that as I would normally, I Mount the Gentoo storage account, my data AI summit 2021 container in the WANdisco Databricks storage account, using the same service principle I referenced before. Once that Mount has completed, there we go. It’s about to complete, I can issue any command in my Databricks notebook that would use the Databricks file system. And here I should see in just a moment, once the Mount points complete, there we go, listing the content in that CA demo data directory shows me that I have access to the same files that were previously held only in my Hadoop cluster. So all of the content is there. So some of the other interesting things I want to show you as part of the demonstration while creating data is one thing, but what if we want to access structured data. To do that here, I’m creating a Hive database in my Hadoop environment that we’ll use for test purposes, a retail database.
And within that, creating a Hive table called retail customers with some structure around it, I’m referencing a location that data AI summit 2021 location in my file system, that I am yet to migrate into my Gentoo storage account. So we’ll now create a migration for that Hive specific data. Again, returning to the live data Migrator instance and the migrations resource type within it. We’ll add a second migration alongside the existing one. We would typically do this many times to have a range of migrations in effect to selectively choose which data, in my case, this is going to be my Hive data. So I’ll call it the retail migration, choose the same target storage, but in this case, the location in which my Hive data are held. Once again, I’m going to start that migration automatically, but currently that location contains no data.
It’s an empty Hive table. We’re going to demonstrate through this process, the live migration capabilities of live data migrator by modifying the content in that Hive table after the fact. So the migration itself is deploying now. Once that’s complete, we’ll have a live migration in effect that will reference the Hive data and make it possible for me to access changing data from Hive directly from my Databricks environment. So just reviewing that now, we can see the migration has succeeded in the migration status is live. So now if I return to my Hadoop environment and well, let’s just have a quick look, first of all, at the Migrator instance, and I’m going to navigate my way through to the storage account, just to show you that we have two locations now present in Gentoo storage, my original CA demo data and the location for my Hive table as well, which should currently be empty.
So just return to the container. Here we are, the data AI summit 2021. We can see my CA demo data directory here, once that refreshes and the data AI summit 2021 location, which is currently empty, has the retail location, but no data within it. There is no Hive content there. So now I’ll return to my Hadoop environment and actually ingest some data to my Hive table. In this case, I’ll do it with an insert command through beeline. So updating my retail customers table with a selection of data from other Hive content, just for the state of California, some subset of data, which will be inserted into my Hive table. We haven’t made any change to the migration, though. It remains in effect, it’s live. It will automatically detect that changing data in my Hive content and migrate it into the Gentoo storage account so that it’s accessible and usable immediately from my Databricks workspace. To support this, I’m going to show you a feature that is about to be made available in the live data platform for Azure.
This is the ability not only to migrate data, as we see here, that Hive data is now available in that Gentoo storage account, but also to migrate the metadata associated with that Hive table. So without that switched on, we can see the data representation, not including the retail database, but if I enable the migration of Hive metadata, which I’ve done in the background and just returned to the data view in my Databricks environment, we see now the retail database present and the retail customers table within it. Again, this is a live migration, any updates, new Hive, metadata changes to it will be reflected immediately from my Databricks environment. With that in place, I can return to my notebook of course, and query that structured data in the Databricks notebook or jobs. However, I like, so in my case, issuing a structured query against the retail database retail customers, I can see just that state of California present within it.
But because live migration is in effect. If I returned to my Hadoop environment and make an update to my Hive content in this case, I’ll do that by adding some Hive data. Let’s do that by inserting data, not just for the state of California, but perhaps for any state that starts with a letter after L right? So in my case, I’ve got a retail customer data for all 50 states. We’re inserting a subset of it. There’s a range of ways live data Migrator for Azure allows you to be selective around data migration. We can do that using the selection by location in the file, or perhaps including an exclusion, which allows us to restrict the set of data based on additional knowledge I’ve got about what will come across, but because live migration is in effect, I can immediately query that data from my data Databricks environment.
And then we see the outcomes with live data and live data metadata being made available in my Azure environment. So with that, I hope you get a good understanding of the simplicity and ease with which you can take advantage of this live data platform for Azure. Now it’s available for public use immediately. Anybody can access a live data Migrator for Azure from the Azure marketplace. You can try it with up to 25 terabytes of data migrated for free at no cost. Other resources that you might want to call on, we have a link here to follow on with additional information about the platform, or if you want it to reach out to any of us. Jeff and I both have available. We can respond to questions for those that are following the presentation today and just reach [email protected] email address. Jeff and I are also presenting a short presentation at the summit lightning talk, talking about the bill versus buy options when it comes to cloud migration. So thanks very much for your time. And I hope it was very worthwhile.

Jeff King

Jeff has over 12 years working in the Azure storage team, currently focused on driving Big Data Performance testing, competitive analysis, and building a big data and analytics partner ecosystem aroun...
Read more

Paul Scott Murphy

Paul Scott-Murphy

Paul has overall responsibility for WANdisco’s product strategy, including the delivery of product to market and its success. This includes directing the product management team, product strategy, r...
Read more