HomepageData + AI Summit 2023 Logo
JUNE 26-29, 2023
Attend Live

Privacy Preserving Machine Learning and Big Data Analytics Using Apache Spark

On Demand


  • Session


  • Virtual


  • Data Security and Governance


  • Intermediate


  • 40 min
Download session slides


In recent years, latest privacy laws & regulations bring a fundamental shift in the protection of data and privacy, placing new challenges to data applications. To resolve these privacy & security challenges in big data ecosystem without impacting existing applications, several hardware TEE (Trusted Execution Environment) solutions have been proposed for Apache Spark, e.g., PySpark with Scone and Opaque etc. However, to the best of our knowledge, none of them provide full protection to data pipelines in Spark applications. An adversary may still get sensitive information from unprotected components and stages. Furthermore, some of them greatly narrowed supported applications, e.g., only support SparkSQL.
In this presentation, we will present a new PPMLA (privacy preserving machine learning and analytics) solution built on top of Apache Spark, BigDL, Occlum and Intel SGX. It ensures all spark components and pipelines are fully protected by Intel SGX, and existing Spark applications written in Scala, Java or Python can be migrated into our platform without any code change. We will demonstrate how to build distributed end-to-end SparkML/SparkSQL workloads with our solution on untrusted cloud environment and share real-world use cases for PPMLA.

Session Speakers

Headshot of Qiyuan Gong

Qiyuan Gong

Software Arch


Headshot of Chunyang Hui

Chunyang Hui

Software Engineer, Occlum Developer

Ant Group

See the best of Data+AI Summit

Watch on demand