Garren Staubli

Solutions Architect, Databricks

Garren is a Solutions Architect at Databricks. He has specialized in big data for 7 years and Apache Spark for the past 4 years. Garren created Structured Streaming and Spark ML production applications to do real-time decision making, built a robust real-time big data science and reporting solution (30B+ records aggregated in < 1 second), and architected the core IP data assets for a B2B marketing company. His interests include enabling data scientists and engineers to use big data at scale to solve vexing problems. Garren has a BBA in Management Information Systems from Washington State University.

Past sessions

Summit 2019 Databricks + Snowflake: Catalyzing Data and AI Initiatives

April 24, 2019 05:00 PM PT

Combining Databricks, the unified analytics platform with Snowflake, the data warehouse built for the cloud is a powerful combo.

Databricks offers the ability to process large amounts of data reliably, including developing scalable AI projects as well as robust data warehousing capabilities. Snowflake offers the elasticity of a cloud-based data warehouse that centralizes the access to data. Databricks brings the unparalleled utility of being based on a mature distributed big data processing and AI-enabled tool to the table, capable of integrating with nearly every technology, from message queues (e.g. Kafka) to databases (e.g. Snowflake) to object stores (e.g. S3) and AI tools (e.g. Tensorflow).

Key Takeaways:
How Databricks & Snowflake work;
Why they're so powerful;
How Databricks + Snowflake symbiotically catalyze analytics and AI initiatives

Python is the de facto language of data science and engineering, which affords it an outsized community of users. However, when many data scientists and engineers come to Spark with a Python background, unexpected performance potholes can stand in the way of progress. These "Performance Potholes" include PySpark's ease of integration with existing packages (e.g. Pandas, SciPy, Scikit Learn, etc), using Python UDFs, and utilizing the RDD APIs instead of Spark SQL DataFrames without understanding the implications. Additionally, Spark 2.3 changes the game even further with vectorized UDFs. In this talk, we will discuss:

- How PySpark works broadly (& why it matters)
- Integrating popular Python packages with Spark
- Python UDFs (how to [not] use them)
- RDDs vs Spark SQL DataFrames
- Spark 2.3 Vectorized UDFs

Session hashtag: #Py9SAIS