Building a Cross Cloud Data Protection Engine

Download Slides

Data Protection is still at the forefront of multiple companies minds with potential GDPR fines of up to 4% of their global annual turnover (creating a current theoretical max fine of $20bn). GDPR effects countries across the world, not just those in Europe, leaving many companies still playing catch up. Additional acts and legislation are coming into place such as CCPA meaning Data Protection is a constantly evolving landscape, with fines that can literally decimate some business. In this session we will go through how we have worked with our customers to create an Azure and AWS implementation of a Data Protection Engine covering Protection, Detection, Re-Identification and Erasure of PII data.

The solution is built with Security and Auditability at the centre of the Architecture, with special consideration for managing a single application across two public clouds; leading us to using Databricks, Delta Lake, Kubernetes and PowerBI. We will deep dive into using Spark to create multiple techniques of Data Protection and how using AI can start to become a game changer in Detecting PII that has been missed in Data. Exploring how Delta Lake empowers us to share PII tokens between cloud providers with ACID transactions, auditing and versioning of data.

With a final look at how Deep Neural Networks can be used to Detect PII within Data, this will be demo packed session. We hope this session shows you that Data Protection doesn’t have to be an off the shelf black box, but you can own the risk and solution within your own platform, whilst still remaining secure and compliant.

Speakers: Sandy May and Richard Conway


– Hello, everyone. Thanks for coming to my presentation. My topic is about Cloud-native Semantic Layer on Data Lake. Firstly, let me introduce about myself briefly. My name is Dong Li. I’m the Head of Product at Kyligence, and also Apache Kylin Committer in the PMC member. And I have worked for eBay and the Microsoft before I joined Kyligence. And here’s today’s agenda. At first, I will start with a real challenge from a customer and then talk about how to build the semantic layer, to solve it with Apache Kylin on Data Lake. Finally, we will have several minutes for Q & A. To better understand the background, let me briefly introduce a real customer we have served in Kyligence. In fact, this is a fast-growing SaaS provider company in US, which has more than 1,800 customers in 40 countries. And 1/3 of the Fortune 500 companies are using their products. As a result, they will gather 8 billion transactions per year and finally view the dashboards with the data sites to the end users for data analysis and decision-making. But now they have challenges. They are using AWS RDS to store all the data and build materialized view to a status that dashboard queries. But with data volume growing more and more, slow queries appear and the more query latency become… In fact, it’s most query latency become more than five minutes. And the more importantly, they need to spend four hours at least to refresh the materialized view everyday. And there are more issues about lack of concurrency, lack of flexibility to the dashboard are also very bothering also with more users onboarding, they need to develop and maintain more views in the system. So, they will require a lot of human effort here. As a result, they expect to upgrade the data platform to overcome these challenges to provide the flexible dashboards for the end users at first and then allow the users to drag and drop the predefined dimensions and measures easily, and gets the high performance with no more than two seconds. And the high concurrency which is more than 100 concurrent users. It should be very easy to scale on demand to saves the resource is invariable. And also they requires the low data preparation latency which is less than one hour. And also should be least efforts to adapt it to the new requirements for the system. And also together with enterprise-grade security, and also AWS deployment and lower TCO. And also they want to build the open platform for more innovative scenarios such as machine learning and so on. And so finally in the world, they are expecting a unified data as a service platform to help the business analyst to retrieve the data by themselves and with a high performance in the concurrency. So, I believe you may have started to think about the solution to these challenges. Now, lastly, about a solution with Apache Kylin. So, in fact, Apache Kylin is an open source distributed analytics data warehouse. In big data industry, just like Hadoop, Spark and the Flink. So, Kylin is also an important component on the big data ecosystem. So, this is the Data and AI Landscape in this year. We can see that Kylin is located as a open source framework, a section with Hadoop and Spark and the Tez as well. So, what does Kylin do in the big data work? Let’s start from this architecture. build the big data platform or Data Lake on Hadoop, Spark and even on Cloud. But there’s always that gap between the data and the business and Kylin is the best choice to fill the gap. We can provides the data mart layer to connect the data and the business. In Kylin, users can things that cable structure from high or other database and then build the OLAP cube to serve BI analysis scenarios and data application. The most important concept of Kylin it’s a cube, or you can say a data model in OLAP domain, which defines the dimension and myers and presented as a semantic layer. And also Kylin will pre-calculate the aggregation results in each scenario to speed up the analysis performance. So, as a result, usually the query performance can be less than one second, even PP skill data site. As Kylin supports standard as a SQL interface, it should be the most welcomed that data on service API for developers and analysts. Another key advantages of Kylin is the high concurrency. Thousands of concurrency users can be supported with Kylin based on the technology of per-calculation. So, in the latest version of Kylin, it can also analyze the batch data and streaming data together, which means if that data is ingested in the form of a real-time stream, you can also correlate that with the SQL in Kylin. And till now, we have seen more than 1,000 user around the world, such as eBay, Cisco, Yahoo Japan, Badoo and so on. Most of them are from China, US and Europe. So, we talked about the performance and the concurrency just now. Here we have done some benchmark as a result. So, we used the standard benchmark data site called Star Schema Benchmark, which includes a series of tables and a data generator to populate any signs of their data site. And also tells SQL queries against these tables. We compare the query performance between Kylin and another SQL Hadoop engine. In the later chart, each bars shows the query latency for the H SQL query. So, the green one is Kylin and the gray one is SQL Hadoop. We can see that Kylin can answer all the queries in less than one second and another engine requires more seconds. In the right chart is shows that… It shows how latency change with data scale increasing. As we know, most of the SQL Hadoop is based on MPP technology to process more data and will require more resource. But if the resource is limited, the latency will become larger. As Kylin is not talents latency, can keep nearly constant with the data scale increasing. So, the most common use case of Apache Kylin is to serve interactive BI, such as Tableau pro BI, Supersize and so on. So, with the standard SQL interface, Kylin provides ODBC and JDBC drivers to integrate. No matter how many rows of this site is under layer, the interaction is smooth. So, how does Kylin accelerate these queries? Let’s go deep into the basic idea of the pre-calculation. Firstly, Kylin uses Apache Calcite to parse the SQL and builds a logic plan, like this tree structure, which is consist of several notes like cable scans, join, filter, aggregation and sort. The time complexity for this process is O because each row is in the original table, it must be scanned. But Kylin re-writes this plan to leverage the pre-calculate result. We call it as cube. In fact, Kylin confused all the aggregation results for different combination of the dimensions and saves the result to the cube. In the query time the logic plan will be optimized to just the scan the rows which has already stored in the cube. And the number of rows in the cube is decided actually by the commonality of the dimensions rather than the size of the inputs. So, we can say that the time compacity for this algorithm is 0 , is the rate. And now let’s look at the technical architecture of Kylin. Let’s start from the left bottom side. Kylin will attract the source data from Hive and the Kafka, since Hive is most common data warehouse. Standard are Hadoop platform. And also Kafka is very popular as a streaming. After the data is extracted, Kylin will execute a series of map-reduce job or stuck jobs to run the particulation, and finally, store the results to Hbase. the run time the SQL query instance from the application to Kylin, and Kylin chances this query corelogic into a series of scan and filters to HBase and also we’ll have the Apache engine here to get the pre-calculation result. And finally we’ve finished the query. So, and besides this, the latest version of Kylin the caltinative features to run all the stuff on the correct platform, just like, we can store the cubes data on either via and Azure Blob and also other objective storage engines here. And then here is the use case from Yahoo Japan, which is the most visited website in Japan. They use the Kylin to report impala the salaries as a reporting workload. The performance improved from the minutes to less than one second. Also they have a rather good architecture of the cross regions solution here, and you can visit Kylin’s website to view the blog and learn more. Here we will… We won’t talk too much to save the time. And nowadays, the Cloud Native has become the better world, even in the big data domain. And Kylin’s also on the way of co-ordinating. Firstly, Kylin can’t have it. Lastly, dependency such like to use Kubernetes instead of Hadoop. And the also to become more lightweight with less dependency. And more important is, Kylin can automated a scaling. Two, there’s a total resource to reduce the total cost. And together with computing storage separation, users can scale out to quickly accomplish the pre-calculation workload, and scale in to keep a small machine site to answer the queries workload, and all the data is processed on the objective storage. So, even there’s no competition resource running, all the data can be processed there. Besides the technical side, we are now at the area of big data. We know more and more enterprise and companies are going digital transformation with data-driven methodology. So, from the industrial report, we see that data size is keep growing in the next years. And you know that more and more data is landing on the Cloud. Because this is the area of the Cloud. And what will happen now in the enterprise? The chaos happens, they have status up here, where data is stored in multiple Cloud platform or multiple data source as a Data Lake. We have so many options and the engines and database to store in the manager data. But the consumers, the data consumers, like the Visa data analyst or the business users, they can not find the right way to reach the data they really want. So, between the applications and the Data Lake, what is missing here? Maybe loss of human adverse is used to writing code to do the ATL, or maybe you want to put another database engine here. So, thought about Kylin. So, there should be a unified layer here to provide the one-stop governed platform to package the data as a service and to provide the single source of truth to the business side to be leveraged by the business analyst or the business users. ‘Cause they can define the data site and the KPIs in the platform and translate the data from the tactical view to the business view. Is eyes of the business analyst, they just need to talk with the pretty models, and won’t take care of the technical stuff, that’s the tables and the columns. And on the technical side, the data models will be pre-calculated as queue, with Kylin to improve the interactive performance. So, as a result with the standard SQL or MDX interface provided, each applications and BI tools can achieve the consolidate data view about the Data Lake, and really obtain a single source of truth. And you know that Kylin is good as a performance informant. So, you can also achieve a high performance in high concurrency. So, this is a so-called Unified Semantic Layer. So, with Unified Semantic Layer, it’s very easy to use a data as a service platform, no matter which tool you use. So, maybe Tableau or Excel. You can see the demo here. We have done the dashboard in Tableau, is the left chart, and also the Excel is the right chart. So, no matter this is Tableau or Excel or other NAPI tools, they’re just with the same data model as the same basis logic definition from the Unified Semantic Layer. And even the under layer data site very large, say, 60 billion rows in total, it can be analyzed interactively easily, even in the Excel. You know that Excel will crash if the data size become larger, say maybe tens of thousand rows, maybe will kill the Excel. But now 60 billion rows can be handled as Excel. You can see that with drag and the drop, is the Excel pivot table, is very… It will works very smoothly. So, maybe we can now talk about the expectation of the customer we’ve just mentioned in the beginning. They want… In fact, what they want is a unified flexible performance data access service platform. Now, they have work, they can just use , this opensource territory on Cloud. And one more case here, we have seen a user use Kylin to replace IBM Cognos Cubes. And the result is, it has more than 1,200 IBM Cognos cubes, are replaced with two cubes of Kylin and also finally achieved with more than 1,000 times approved maintenance efficiency, and 10 times faster and even more stable performance here. Finally, let’s talk about the architecture here. Once you have landed your data in the Data Lake on Cloud, you have centralized your data in the different storage, different engines here and you want to get better analytics performance. And the consolidated data view above all of the data is at Data Lake. And to provide a unified experience on the BI, Kylin can provide the Unified Semantic Service Layer, above that still provide a whole data utilization for the all apps scenario. So, until now we have finished the presentation today. Now we have seven more days to do the Q & A session. Thank you.

Watch more Data + AI sessions here
Try Databricks for free
« back
About Sandy May


Sandy is a Lead Data Engineer and CTO at Elastacloud where he has worked for 4 years on myriad projects ranging from SME to FTSE 100 customers. He is a strong advocate of Databricks on Azure and using Spark to solve Big Data problems; he has recently become a Databricks Champion. Having worked on one of the original Databricks on Azure projects he continues to expand his Big Data knowledge using new Open Source technologies such as Delta Lake and ML Flow. Sandy co-organises the Data Science London meet-up and continues to try to push what he picks up back to the community so they can learn from his mistakes. Having spoken at Spark Summit, Future Decoded, Red Shirt tours and more his knowledge range covers most of the Azure Data Stack with a keen interest in Big Data, Machine Learning and Data Visualisation.

About Richard Conway


Richard Conway is a founder of cloud data consultancy Elastacloud specializing in big data on Microsoft Azure. He is a programmer and author of 22 years, having traversed languages such as C++, C#, Java, and more recently Scala, Python, and R. He is a regular contributor to open source projects including the Apache stack. In his spare time he runs the UK Azure Users Group covering all aspects of the Microsoft cloud and the popular kids programming event AzureCraft helping kids to build AIs in Minecraft.