Itai Yaffe is a Principal Solutions Architect at Imply. Prior to Imply, Itai was a big data tech lead at Nielsen Identity, where he dealt with big data challenges using tools like Spark, Druid, Kafka, and others. He is also a part of the Israeli chapter’s core team of Women in Big Data. Itai is keen about sharing his knowledge and has presented his real-life experience in various forums in the past.
May 26, 2021 12:05 PM PT
Every day, millions of advertising campaigns are happening around the world.
As campaign owners, measuring the ongoing campaign effectiveness (e.g “how many distinct users saw my online ad VS how many distinct users saw my online ad, clicked it and purchased my product?”) is super important.
However, this task (often referred to as “funnel analysis”) is not an easy task, especially if the chronological order of events matters.
One way to mitigate this challenge is combining Apache Druid and Apache DataSketches, to provide fast analytics on large volumes of data.
However, while that combination can answer some of these questions, it still can’t answer the question "how many distinct users viewed the brand’s homepage FIRST and THEN viewed product X page?"
In this talk, we will discuss how we combine Spark, Druid and DataSketches to answer such questions at scale.
November 18, 2020 04:00 PM PT
At Nielsen Identity, we use Apache Spark to process 10's of TBs of data, running on AWS EMR. We started at a point where Spark was not even supported out-of-the-box by EMR, and today we're spinning-up clusters with 1000's of nodes on a daily basis, orchestrated by Airflow. A few months ago, we embarked on a journey to evaluate the option of using Kubernetes as our Spark infrastructure, mainly to reduce operational costs and improve stability (as we heavily rely on Spot Instances for our clusters). To allow us to achieve those goals, we combined the open-sourced GCP Spark-on-K8s operator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator) with a native Airflow integration we developed and recently contributed back to the Airflow project (https://issues.apache.org/jira/browse/AIRFLOW-6542). Finally, we were able to migrate our existing Airflow DAGs, with minimal changes, from AWS EMR to K8s.
In this talk, we'll guide you through migrating Spark workloads to K8s, including:
* Challenges with existing Spark infrastructure and the motivation to migrate to K8s
* Aspects of running Spark natively on K8s (e.g monitoring, logging, etc.)
* Best practices for using Airflow as the orchestrator
Speakers: Itai Yaffe and Roi Teveth
October 16, 2019 05:00 PM PT
At NMC (Nielsen Marketing Cloud) we provide our customers (marketers and publishers) real-time analytics tools to profile their target audiences. To achieve that, we need to ingest billions of events per day into our big data stores, and we need to do it in a scalable yet cost-efficient manner.
In this session, we will discuss how we continuously transform our data infrastructure to support these goals. Specifically, we will review how we went from CSV files and standalone Java applications all the way to multiple Kafka and Spark clusters, performing a mixture of Streaming and Batch ETLs, and supporting 10x data growth We will share our experience as early-adopters of Spark Streaming and Spark Structured Streaming, and how we overcame technical barriers (and there were plenty). We will present a rather unique solution of using Kafka to imitate streaming over our Data Lake, while significantly reducing our cloud services' costs. Topics include: