Felix Cheung

VP of Engineering, SafeGraph

Felix is the VP of Engineering at SafeGraph, bringing over 20 years of engineering and 7 years of data experience. He led teams in Uber’s Data Platform and was pivotal in rebuilding their open-source program. Previously he spent time at Microsoft and startups. Felix is a strong proponent of open-source; as a Member of the Apache Software Foundation, he works on Apache Spark (data), Apache Zeppelin (notebook), and also helps mentor 6 projects in the Apache Incubator, including geospatial project Apache Sedona, and leading Apache Superset (visualization) to graduate.

Past sessions

SafeGraph is a data company — just a data company — that aims to be the source of truth for data on physical places. We are focused on creating high-precision geospatial data sets specifically about places where people spend time and money. We have business listings, building footprint data, and foot traffic insights for over 7 million across multiple countries and regions.

In this talk, we will inspect the challenges with geospatial processing, running at a large scale. We will look at open-source frameworks like Apache Sedona (incubating) and its key improvements over conventional technology, including spatial indexing and partitioning. We will explore spatial data structure, data format, and open-source indexing like H3. We will illustrate how all of these fit together in a cloud-first architecture running on Databricks, Delta, MLFlow, and AWS. We will explore examples of geospatial analysis with complex geometries and practical use cases of spatial queries. Lastly, we will discuss how this is augmented by Machine Learning modeling, Human-in-the-loop (HITL) annotation, and quality validation.

In this session watch:
Felix Cheung, VP of Engineering, SafeGraph


Summit 2018 Birds of a Feather Session: Apache Spark on Kubernetes

February 8, 2023 10:56 AM PT

Come learn about Apache Spark's Kubernetes scheduler backend, new in Spark 2.3! Meet project contributors and network with community members interested in running Spark on Kubernetes. Learn about upcoming Spark features for Kubernetes support, and find out how to contribute to the project. Discover new tools in the Spark on Kubernetes ecosystem, and trade tips on how to run Spark jobs on your Kubernetes cluster.

In this talk, we will explore how Uber enables rapid experimentation of machine learning models and optimization algorithms through the Uber's Data Science Workbench (DSW). DSW covers a series of stages in data scientists' workflow including data exploration, feature engineering, machine learning model training, testing and production deployment. DSW provides interactive notebooks for multiple languages with on-demand resource allocation and share their works through community features.

It also has support for notebooks and intelligent applications backed by spark job servers. Deep learning applications based on TensorFlow and Torch can be brought into DSW smoothly where resources management is taken care of by the system. The environment in DSW is customizable where users can bring their own libraries and frameworks. Moreover, DSW provides support for Shiny and Python dashboards as well as many other in-house visualization and mapping tools.

In the second part of this talk, we will explore the use cases where custom machine learning models developed in DSW are productionized within the platform. Uber applies Machine learning extensively to solve some hard problems. Some use cases include calculating the right prices for rides in over 600 cities and applying NLP technologies to customer feedbacks to offer safe rides and reduce support costs. We will look at various options evaluated for productionizing custom models (server based and serverless). We will also look at how DSW integrates into the larger Uber's ML ecosystem, e.g. model/feature stores and other ML tools, to realize the vision of a complete ML platform for Uber.

Session hashtag: #MLSAIS11

Summit East 2017 Scalable Data Science with SparkR

February 8, 2017 04:00 PM PT

R is a very popular platform for Data Science. Apache Spark is a highly scalable data platform. How could we have the best of both worlds? How could a Data Scientist leverage the rich 9000+ packages on CRAN, and integrate Spark into their existing Data Science toolset?In this talk we will walkthrough many examples how several new features in Apache Spark 2.x will enable this. We will also look at exciting changes in and coming next in Apache Spark 2.x releases.

Summit 2017 SSR: Structured Streaming on R for Machine Learning

June 5, 2017 05:00 PM PT

Stepping beyond ETL in batches, large enterprises are looking at ways to generate more up-to-date insights. As we step into the age of Continuous Application, this session will explore the ever more popular Structure Streaming API in Apache Spark, its application to R, and building examples of machine learning use cases. Starting with an introduction to the high-level concepts, the session will dive into the core of the execution plan internals and examine how SparkR extends the existing system to add the streaming capability. Learn how to build various data science applications on data streams integrating with R packages to leverage the rich R ecosystem of 10k+ packages. Session hashtag: #SFdev2