Jianneng Li is a Software Development Engineer at Workday (and previously with Platfora, acquired by Workday in late 2016). He works on Prism Analytics, an end-to-end data analytics product, part of the Workday ecosystem, which helps businesses better understand their financials and HR data. Being a part of Spark team in Analytics org, Jianneng specializes in distributed systems and data processing. He enjoys diving into Spark internals and published several blog posts about Apache Spark and analytics. Jianneng holds a Master’s degree in EECS from UC Berkeley and a Bachelor’s degree in CS from Cornell University.
May 28, 2021 11:40 AM PT
For more than 6 years, Workday has been building various analytics products powered by Apache Spark. At the core of each product offering, customers use our UI to create data prep pipelines, which are then compiled to DataFrames and executed by Spark under the hood. As we built out our products, however, we started to notice places where vanilla Spark is not suitable for our workloads. For example, because our Spark plans are programmatically generated, they tend to be very complex, and often result in tens of thousands of operators. Another common issue is having case statements with thousands of branches, or worse, nested expressions containing such case statements.
With the right combination of these traits, the final DataFrame can easily take Catalyst hours to compile and optimize - that is, if it doesn’t first cause the driver JVM to run out of memory.
In this talk, we discuss how we addressed some of our pain points regarding complex pipelines. Topics covered include memory-efficient plan logging, using common subexpression elimination to remove redundant subplans, rewriting Spark’s constraint propagation mechanism to avoid exponential growth of filter constraints, as well as other performance enhancements made to Catalyst rules.
We then apply these changes to several production pipelines, showcasing the reduction of time spent in Catalyst, and list out ideas for further improvements. Finally, we share tips on how you too can better handle complex Spark plans.
June 24, 2020 05:00 PM PT
Broadcast join is an important part of Spark SQL's execution engine. When used, it performs a join on two relations by first broadcasting the smaller one to all Spark executors, then evaluating the join criteria with each executor's partitions of the other relation. When the broadcasted relation is small enough, broadcast joins are fast, as they require minimal data shuffling. Above a certain threshold however, broadcast joins tend to be less reliable or performant than shuffle-based join algorithms, due to bottlenecks in network and memory usage. This talk shares the improvements Workday has made to increase the threshold of relation size under which broadcast joins in Spark are practical. We cover topics such as executor-side broadcast, where data is broadcasted directly between executors instead of being first collected to the driver, as well as the changes we made in Spark's whole-stage code generator to accommodate the increased threshold for broadcasting. Specifically, we highlight how we limited the memory usage of broadcast joins in executors, so that we can increase the threshold without increasing executor memory. Furthermore, we present our findings from running these improvements in production, on large scale jobs compiled from complex ETL pipelines. From this session, the audience will be able to peek into internals of Spark's join infrastructure, and take advantage of our learnings to better tune their own workloads.
April 23, 2019 05:00 PM PT
In this talk, we will share how we benefited from using Apache Spark to build Workday's new analytics product, as well as some of the challenges we faced along the way. Workday Prism Analytics was launched in September 2017, and went from zero to one hundred enterprise customers in under 15 months. Leveraging innovative technologies from Platfora acquisition gave us a jump-start, but it still required a considerable engineering effort to integrate with Workday ecosystem. We enhanced workflows, added new functionalities and transformed Hadoop-based on-premises engines to run on Workday cloud. All of this would not have been possible without Spark, to which we migrated most of earlier MapReduce code. This enabled us to shorten time to market while adding advanced functionality with high performance and rock-solid reliability. One of the key components of our product is Self-Service Data Prep. Powerful and intuitive UI empower users to create ETL-like pipelines, blending Workday and external data, while providing immediate feedback by re-executing the pipelines on sampled data. Behind the scenes, we compile these pipelines into plans to be executed by Spark SQL, taking advantage of the years of work done by the open source community to improve the engine's query optimizer and physical execution.
We will outline the high-level implementation of product features, mapping logical models and sub-systems, adding new data types on top of Spark, and using caches effectively and securely, in multiple Spark clusters running under YARN, while sharing HDFS resources. We will also describe several real-life war stories, caused by customers stretching the product boundaries in complexity and performance. We conclude with the unique Spark tuning guidelines distilled from our experience of running it in production, in order to ensure that the system is able to execute complex, nested pipelines with multiple self-joins and self-unions.