DB Tsai

Open Source and Big Data, Apple

DB Tsai is an Apache Spark PMC / Committer and an open source and big data engineer at Apple. He implemented several algorithms including linear models with Elastici-Net (L1/L2) regularization using LBFGS/OWL-QN optimizers in Apache Spark. Prior to joining Apple, DB worked on Personalized Recommendation ML Algorithms at Netflix. DB was a Ph.D. candidate in Applied Physics at Stanford University. He holds a Master’s degree in Electrical Engineering from Stanford.

Past sessions

Summit 2020 Native Support of Prometheus Monitoring in Apache Spark 3.0

June 23, 2020 05:00 PM PT

All production environment requires monitoring and alerting. Apache Spark also has a configurable metrics system in order to allow users to report Spark metrics to a variety of sinks. Prometheus is one of the popular open-source monitoring and alerting toolkits which is used with Apache Spark together. Previously, users can use

  1. a combination of Prometheus JMX exporter and Apache Spark JMXSink
  2. 3rd party libraries
  3. implement a custom Sink for more complex metrics like GPU resource usage

Apache Spark 3.0.0 will add another easy way to support Prometheus for general use cases. In this talk, we will talk about the followings and show a demo.

  1. How to enable new Prometheus features.
  2. What kind of metrics are available.
  3. General tips for monitoring and alerting on structured streaming jobs. (Spark side / Prometheus side)

Currently, Apache Spark exposes metrics at Master/Worker/Driver/Executor to integrate with the existing Prometheus server easily with a less effort. This is already available with Apache Spark 3.0.0-preview and preview2. You can try it right now.

Summit 2019 Making Nested Columns as First Citizen in Apache Spark SQL

April 23, 2019 05:00 PM PT

Apple Siri is the world's largest virtual assistant service powering every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod. We use large amounts of data to provide our users the best possible personalized experience. Our raw event data is cleaned and pre-joined into an unified data for our data consumers to use. To keep the rich hierarchical structure of the data, our data schemas are very deep nested structures. In this talk, we will discuss how Spark handles nested structures in Spark 2.4, and we'll show the fundamental design issues in reading nested fields which is not being well considered when Spark SQL was designed. This results in Spark SQL reading unnecessary data in many operations. Given that Siri's data is super nested and humongous, this soon becomes a bottleneck in our pipelines. Then we will talk about the various approaches we have taken to tackle this problem. By making nested columns as first citizen in Spark SQL, we can achieve dramatic performance gain. In some of our production queries, the speed-up can be 20x in wall clock time and 8x less data being read. All of our work will be open source, and some has already been merged into upstream.

Summit 2019 Bridging the Gap Between Datasets and DataFrames

April 24, 2019 05:00 PM PT

Apple leverages Apache Spark for processing large datasets to power key components of Apple's production services. The majority of users rely on Spark SQL to benefit from state-of-the-art optimizations in Catalyst and Tungsten. As there are multiple APIs to interact with Spark SQL, users have to make a wise decision which one to pick. While DataFrames and SQL are widely used, they lack type safety so that the analysis errors will not be detected during the compile time such as invalid column names or types. Also, the ability to apply the same functional constructions as on RDDs is missing in DataFrames. Datasets expose a type-safe API and support for user-defined closures at the cost of performance.

This talk will explain cases when Spark SQL cannot optimize typed Datasets as much as it can optimize DataFrames. We will also present an effort to use bytecode analysis to convert user-defined closures into native Catalyst expressions. This helps Spark to avoid the expensive conversion between the internal format and JVM objects as well as to leverage more Catalyst optimizations. A consequence, we can bridge the gap in performance between Datasets and DataFrames, so that users do not have to sacrifice the benefits of Datasets for performance reasons.

Summit Europe 2018 Pitfalls of Apache Spark at Scale

October 7, 2022 03:05 AM PT

Apple Siri is the world’s largest virtual assistant service powering every iPhone, iPad, Mac, Apple TV, Apple Watch, and HomePod. A.I. and machine learning are used to personalize your experience throughout your day. The more you use, the more helpful it can be. At this scale, the information we are processing is very humongous. We use Apache Spark to get the job done.

In this talk, we will discuss the architecture of Siri data pipelines, and in particular, how Apache Spark is used to aggregate the data coming from different data centers globally into one source of truth for analytical use-cases for ML model building and productizing. We will talk about the specific techniques we use at Siri to scale, and various pitfalls we have found along the way. As part of the OSS community, we contributed back many features and bug fixes during the process; as a result, all the Spark users can get the significant run time improvement and resource savings.

Session hashtag: #SAISML1

Notebook is a widely used tools for data scientists to analyze the data to find insight and build learning models. However, there is a gap bringing the notebook into production pipeline. How do we streamline the process to deploy the notebook into production? If we find something needed to be improved in production, how do we shorten the cycles? How do we make sure there is no discrepancy between the online feature generation which will be used for online service and the features generated offline for model training?

Nonlinear methods are widely used to produce higher performance compared with linear methods; however, nonlinear methods are generally more expensive in model size, training time, and scoring phase. With proper feature engineering techniques like polynomial expansion, the linear methods can be as competitive as those nonlinear methods. In the process of mapping the data to higher dimensional space, the linear methods will be subject to overfitting and instability of coefficients which can be addressed by penalization methods including Lasso and Elastic-Net. Finally, we’ll show how to train linear models with Elastic-Net regularization using MLlib.

Summit East 2016 Distributed Time Travel for Feature Generation

February 16, 2016 04:00 PM PT

Learning is an analytic process of exploring the past in order to predict the future. Hence, being able to travel back in time to create feature is critical for machine learning projects to be successful. At Netflix, we spend significant time and effort experimenting with new features and new ways of building models. This involves generating features for our members from different regions over multiple days. To enable this, we built a time machine using Apache Spark that computes features for any arbitrary time in the recent past. The first step of building this time machine is to snapshot the data from various micro services on a regular basis. We built a general purpose workflow orchestration and scheduling framework optimized for machine learning pipelines and used it to run the snapshot and model training workflows. Snapshot data is then consumed by feature encoders to compute various features for offline experimentation and model training. Crucially, the same feature encoders are used in both offline model building and online scoring for production or A/B tests. Building this time machine helped us try new ideas quickly without placing stress on production services and without having to wait for data accumulation of the newly-implemented features. Moreover, building it with Apache Spark empowered us to both scale up the data size by an order of magnitude and train and validate the models in less time. Finally, using Apache Zeppelin notebook, we are able to interactively prototype features and run experiments.

Summit East 2017 Netflix’s Recommendation ML Pipeline Using Apache Spark

February 7, 2017 04:00 PM PT

Netflix is the world’s largest streaming service, with 80 million members in over 250 countries. Netflix uses machine learning to inform nearly every aspect of the product, from the recommendations you get, to the boxart you see, to the decisions made about which TV shows and movies are created.Given this scale, we utilized Apache Spark to be the engine of our recommendation pipeline. Apache Spark enables Netflix to use a single, unified framework/API – for ETL, feature generation, model training, and validation. With pipeline framework in Spark ML, each step within the Netflix recommendation pipeline (e.g. label generation, feature encoding, model training, model evaluation) is encapsulated as Transformers, Estimators and Evaluators – enabling modularity, composability and testability. Thus, Netflix engineers can build our own feature engineering logics as Transformers, learning algorithms as Estimators, and customized metrics as Evaluators, and with these building blocks, we can more easily experiment with new pipelines and rapidly deploy them to production.

In this talk, we will discuss how Apache Spark is used as a distributed framework we build our own algorithms on top of to generate personalized recommendations for each of our 80+ million subscribers, specific techniques we use at Netflix to scale, and the various pitfalls we’ve found along the way.

Learn more:

  • ML Pipelines: A New High-Level API for MLlib
  • Audience Modeling With Apache Spark ML Pipelines
  • Netflix is the world’s largest streaming service, with over 80 million members worldwide. Machine learning algorithms are used to recommend relevant titles to users based on their tastes.At Netflix, we use Apache Spark to power our recommendation pipeline. Stages in the pipeline, such as label generation, data retrieval, feature generation, training, validation, are based on Spark ML PipleStage framework. While this provides developers the flexibility to develop individual components as encapsulated pipeline stages, we find that coordination across stages can potentially provide significant performance gains.
    In this talk, we discuss how our machine learning pipeline based on Spark has been improved over the years. Techniques such as predicate pushdown, wide transformation minimization, have lead to significant run time improvement and resource savings.

    Session hashtag: #SFexp9

    Summit Europe 2017 VEGAS: The Missing Matplotlib for Scala/Apache Spark

    October 24, 2017 05:00 PM PT

    In this talk, we'll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix's famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the "missing MatPlotLib" for Spark/Scala. We'll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models.
    Session hashtag: #EUds0