Tejas Patil is a Spark Committer and Tech Lead in the Spark team at Facebook. For past 7 years, he has worked on several projects related to building large scale distributed data processing systems responsible for handling Facebook’s batch workloads. He is currently a PMC member and committer of Apache Nutch and has contributed to several open source projects. Tejas obtained a Master’s Degree in Computer Science from University Of California, Irvine.
May 27, 2021 11:35 AM PT
While most query engines come with a rich set of functions, it does not cover all the needs of users. In such cases, user defined functions (UDFs) allow users to express their business logic and use it in their queries. It is common for users to use more than one compute engine for solving their data problems. At Facebook, we provide multiple systems for users to solve their data problems : adhoc, batch, streaming / real-time. Users end up picking a system based off of their needs and problems at hand. Every system typically has its own way of allowing users to create UDFs. If a UDF was defined in one system, sooner or later there would be a need to have similar UDF in the other ones as well. This leads to users having to re-write the same UDFs multiple times to target for each system they want to use it in.
In this talk, we’ll take a deep dive in the Portable UDF. Portable UDF is our way of allowing users to write a function once in an engine agnostic way and use it across several compute engines. We’ll present the motivation, design and current state of Portable UDF project.
June 5, 2017 05:00 PM PT
Bucketing is a partitioning technique that can improve performance in certain data transformations by avoiding data shuffling and sorting. The general idea of bucketing is to partition, and optionally sort, the data based on a subset of columns while it is written out (a one-time cost), while making successive reads of the data more performant for downstream jobs if the SQL operators can make use of this property. Bucketing can enable faster joins (i.e. single stage sort merge join), the ability to short circuit in FILTER operation if the file is pre-sorted over the column in a filter predicate, and it supports quick data sampling.
In this session, you'll learn how bucketing is implemented in both Hive and Spark. In particular, Patil will describe the changes in the Catalyst optimizer that enable these optimizations in Spark for various bucketing scenarios. Facebook's performance tests have shown bucketing to improve Spark performance from 3-5x faster when the optimization is enabled. Many tables at Facebook are sorted and bucketed, and migrating these workloads to Spark have resulted in a 2-3x savings when compared to Hive. You'll also hear about real-world applications of bucketing, like loading of cumulative tables with daily delta, and the characteristics that can help identify suitable candidate jobs that can benefit from bucketing.
Session hashtag: #SFdev10