Improving Interactive Querying Experience on Spark SQL
- Data Lakes, Data Warehouses and Data Lakehouses
- Moscone South | Level 2 | 202
- 35 min
Being a data driven company, interactive querying on hundreds of petabytes of data is a very common and an important function at Pinterest. Interactive querying has different requirements and challenges from batch querying. In this talk, we will talk about various architectural alternatives one can choose from to perform interactive querying with Spark SQL. Through thorough discussion on trade-offs of each of those architectures and requirements for interactive querying, we will elaborate on the reasoning for our design choice. We will further share enhancements we made to open source projects including Apache Spark, Apache Livy and Dr. Elephant along with in-house technologies we built to improve interactive querying experience at Pinterest. We will share enhancements like DDL query speed ups, spark session caching, spark session sharing, Apache Yarn’s diagnostic message improvements, query failure handling and query tuning recommendations. We will also discuss some challenges we faced along the way and future improvements we are working on.
After attending this session you will have a better grasp on different architectures to support interactive querying with Spark SQL on multi-tenant clusters, and will be able to use the discussion from this talk to make a decision. Furthermore, you will be able to utilize the enhancements we will share with you to improve interactive querying experience for your users.