Ilai is a Big Data Developer at Nielsen, responsible for building massive data pipelines that stream huge amount of data (~250 Billion events/day). Our projects run on AWS, using Spark on EMR, serverless Lambda functions and Kubernetes. He has a B.Sc in Computer Science and started his programing career 13 year ago. He then moved into the Big Data area which he loves. He is especially passionate about tackling complex problems, building huge pipelines and sharing his knowledge with others.
November 18, 2020 04:00 PM PT
Spark is a beast of a technology and can do amazing things, especially with large datasets. But some big data pipelines require processing the data in small chunks and running them through a large Spark cluster can be inefficient and expensive.
In this talk we’ll describe a system we’ve built using many independent spark clusters running in parallel, side by side, in Serverless style. We run them on a Kubernetes cluster, but don’t let this confuse you with Spark on Kubernetes which runs one large Spark cluster on Kubernetes. Our system scales up and down on the fly by spinning up/down more independant Spark clusters and is capable of processing huge amounts of data, at an affordable cost.
We’ll walk you through the reasoning behind this unique Spark serverless architecture, its’ benefits and how we went about building it. You’ll learn how to evaluate your own Spark cluster architecture and figure out if you too should consider using such an approach to save costs and reduce processing time.
Speakers: Opher Dubrovsky and Ilai Malka