Spark Structured Streaming in Apache Spark 2.2 comes with quite a few unique Catalyst operators, most notably stateful streaming operators and three different output modes. Understanding how Spark Structured Streaming manages intermediate state between triggers and how it affects performance is paramount. After all you use Apache Spark for processing huge amount of data that alone can be tricky to get right, and Spark Structured Streaming adds the additional streaming factor that given a structured query can make the data even bigger due to state management.
This deep-dive talk is going to show you what is included in execution diagrams, logical and physical plans, and metrics in SQL tab’s Details for Query page. The talk will also explain the other parts of SQL tab and the subpages with details for streaming queries.
The talk is going to answer the following questions:
– What do blue boxes represent in Details for Query page in SQL tab?
– What does the black popup window tell me when hovering over a blue box in Details for Query page in SQL tab?
– What’s under Details section at the bottom in Details for Query page in SQL tab?
– Why does a single streaming query execute many queries as shown in SQL tab?
– What are the Spark jobs in Spark Jobs page in Jobs tab?
– Why would a single query execution lead to zero or more Spark jobs? How does the translation happen?
– Why are the shuffles/exchanges in an execution plan for a streaming aggregation query? and more!
Jacek is an independent consultant who offers development and training services for Apache Spark (and Scala, sbt with a bit of Hadoop YARN, Apache Kafka, Apache Hive, Apache Mesos, Akka Actors/Stream/HTTP, and Docker). He leads Warsaw Scala Enthusiasts and Warsaw Spark meetups. The latest project is to get in-depth understanding of Apache Spark in https://jaceklaskowski.gitbooks.io/mastering-apache-spark/.