Privacy Preserving Machine Learning and Big Data Analytics Using Apache Spark
- Data Security and Governance
- 40 min
In recent years, latest privacy laws & regulations bring a fundamental shift in the protection of data and privacy, placing new challenges to data applications. To resolve these privacy & security challenges in big data ecosystem without impacting existing applications, several hardware TEE (Trusted Execution Environment) solutions have been proposed for Apache Spark, e.g., PySpark with Scone and Opaque etc. However, to the best of our knowledge, none of them provide full protection to data pipelines in Spark applications. An adversary may still get sensitive information from unprotected components and stages. Furthermore, some of them greatly narrowed supported applications, e.g., only support SparkSQL.
In this presentation, we will present a new PPMLA (privacy preserving machine learning and analytics) solution built on top of Apache Spark, BigDL, Occlum and Intel SGX. It ensures all spark components and pipelines are fully protected by Intel SGX, and existing Spark applications written in Scala, Java or Python can be migrated into our platform without any code change. We will demonstrate how to build distributed end-to-end SparkML/SparkSQL workloads with our solution on untrusted cloud environment and share real-world use cases for PPMLA.