Tathagata Das is an Apache Spark committer and a member of the PMC. He’s the lead developer behind Spark Streaming and currently develops Structured Streaming. Previously, he was a grad student in the UC Berkeley at AMPLab, where he conducted research about data-center frameworks and networks with Scott Shenker and Ion Stoica.
October 15, 2019 05:00 PM PT
Most data practitioners grapple with data reliability issues—it's the bane of their existence. Data engineers, in particular, strive to design, deploy, and serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Built on open standards, Delta Lake employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data engineering, the challenges data engineers face when it comes to data reliability and performance and how Delta Lake can help. Through presentation, code examples and notebooks, we will explain these challenges and the use of Delta Lake to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions on how to get tutorial materials will be covered in class.
What you’ll learn:
Prerequisites
October 15, 2019 05:00 PM PT
Most data practitioners grapple with data reliability issues—it's the bane of their existence. Data engineers, in particular, strive to design, deploy, and serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Built on open standards, Delta Lake employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data engineering, the challenges data engineers face when it comes to data reliability and performance and how Delta Lake can help. Through presentation, code examples and notebooks, we will explain these challenges and the use of Delta Lake to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions on how to get tutorial materials will be covered in class.
What you’ll learn:
Prerequisites
October 16, 2019 05:00 PM PT
Most data practitioners grapple with data reliability issues—it's the bane of their existence. Data engineers, in particular, strive to design, deploy, and serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Built on open standards, Delta Lake employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data engineering, the challenges data engineers face when it comes to data reliability and performance and how Delta Lake can help. Through presentation, code examples and notebooks, we will explain these challenges and the use of Delta Lake to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions on how to get tutorial materials will be covered in class.
What you’ll learn:
Prerequisites
October 16, 2019 05:00 PM PT
Most data practitioners grapple with data reliability issues—it's the bane of their existence. Data engineers, in particular, strive to design, deploy, and serve reliable data in a performant manner so that their organizations can make the most of their valuable corporate data assets.
Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark™ and big data workloads. Built on open standards, Delta Lake employs co-designed compute and storage and is compatible with Spark API’s. It powers high data reliability and query performance to support big data use cases, from batch and streaming ingests, fast interactive queries to machine learning. In this tutorial we will discuss the requirements of modern data engineering, the challenges data engineers face when it comes to data reliability and performance and how Delta Lake can help. Through presentation, code examples and notebooks, we will explain these challenges and the use of Delta Lake to address them. You will walk away with an understanding of how you can apply this innovation to your data architecture and the benefits you can gain.
This tutorial will be both instructor-led and hands-on interactive session. Instructions on how to get tutorial materials will be covered in class.
What you’ll learn:
Prerequisites
October 15, 2019 05:00 PM PT
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark's built-in functions make it easy for developers to express complex computations. Delta Lake, on the other hand, is the best way to store structured data because it is a open-source storage layer that brings ACID transactions to Apache Spark and big data workloads Together, these can make it very easy to build pipelines in many common scenarios. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem that needs to be solved. Apache Spark, being a unified analytics engine doing both batch and stream processing, often provides multiples ways to solve the same problem. So understanding the requirements carefully helps you to architect your pipeline that solves your business needs in the most resource efficient manner.
In this talk, I am going examine a number common streaming design patterns in the context of the following questions.
Clarity in understanding the 'what and why' of any problem can automatically much clarity on the 'how' to architect it using Structured Streaming and, in many cases, Delta Lake.
April 24, 2019 05:00 PM PT
Structured Streaming has proven to be the best platform for building distributed stream processing applications. Its unified SQL/Dataset/DataFrame APIs and Spark's built-in functions make it easy for developers to express complex computations. However, expressing the business logic is only part of the larger problem of building end-to-end streaming pipelines that interact with a complex ecosystem of storage systems and workloads. It is important for the developer to truly understand the business problem needs to be solved.
These are the questions that we ask every customer in order to help them design their pipeline. In this talk, I am going to go through the decision tree of designing the right architecture for solving your problem.
October 2, 2018 05:00 PM PT
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I will dive deep into different stateful operations (streaming aggregations, deduplication and joins) and how they work under the hood in the Structured Streaming engine.
October 2, 2018 05:00 PM PT
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I will dive deep into different stateful operations (streaming aggregations, deduplication and joins) and how they work under the hood in the Structured Streaming engine.
Session hashtag: #SAISDD3
June 4, 2018 05:00 PM PT
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I'm going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
Session hashtag: #DD1SAIS
June 4, 2018 05:00 PM PT
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming.
In particular, I'm going to discuss the following.
• Different stateful operations in Structured Streaming
• How state data is stored in a distributed, fault-tolerant manner using State Stores
• How you can write custom State Stores for saving state to external storage systems.
Session hashtag: #DD1SAIS
March 17, 2015 05:00 PM PT
Spark Streaming extends the core Apache Spark API to perform large-scale stream processing, which is revolutionizing the way Big “Streaming” Data application are being written. It is rapidly adopted by companies spread across various business verticals – ad and social network monitoring, real-time analysis of machine data, fraud and anomaly detections, etc. These companies are mainly adopting Spark Streaming because – Its simple, declarative batch-like API makes large-scale stream processing accessible to non-scientists. – Its unified API and a single processing engine (i.e. Spark core engine) allows a single cluster and a single set of operational processes to cover the full spectrum of uses cases – batch, interactive and stream processing. – Its stronger, exactly-once semantics makes it easier to express and debug complex business logic. In this talk, I am going to elaborate on such adoption stories, highlighting interesting use cases of Spark Streaming in the wild. In addition, I am also going to talk about (and perhaps also demonstrate) the exciting new developments in Spark Streaming and the wish list of features that we may target in the future.
June 15, 2015 05:00 PM PT
Spark Streaming extends the core Apache Spark to perform large-scale stream processing. It is being rapidly adopted by companies spread across various business verticals – ad monitoring, real-time analysis of machine data, anomaly detections, etc. This interest is due to its simple, high-level programming model, and its seamless integration with SQL querying (Spark SQL), machine learning algorithms (MLlib), etc. However, for building a real-time streaming analytics pipeline, its not sufficient to be able to easily express your business logic. Running the platform with high uptimes and continuously monitoring it has a lot of operational challenges. Fortunately, Spark Streaming makes all that easy as well. In this talk, I am going to elaborate about various operational aspects of a Spark Streaming application at different stages of deployment – prototyping, testing, monitoring continuous operation, upgrading. In short, all the recipes that takes you from “hello-world” to large scale production in no time.
Learn more:
February 17, 2016 04:00 PM PT
As the adoption of Spark Streaming increases rapidly, the community has been asking for greater robustness and scalability from Spark Streaming applications in a wider range of operating environments. To fulfill these demands, we have steadily added a number of features in Spark Streaming. We have added backpressure mechanisms which allows Spark Streaming to dynamically adapt to changes in incoming data rates, and maintain stability of the application. In addition, we are extending Spark's Dynamic Allocation to Spark Streaming, so that streaming applications can elastically scale based on processing requirements. In my talk, I am going to explore these mechanisms and explain how developers can write robust, scalable and adaptive streaming applications using them.
In Spark 2.0, we have extended DataFrames and Datasets in Spark to handle streaming data. Streaming Datasets not only provides a single programming abstraction for batch and streaming data, it brings support for event-time based processing, out-or-order/delayed data, sessionization and tight integration with non-streaming data sources and sinks. In this talk, I will take a deep dive into the concepts and the API and show how this simplifies building complex “continuous applications”.
Learn more:
February 7, 2017 04:00 PM PT
In mid-2016, we introduced Structured Steaming, a new stream processing engine built on Spark SQL that revolutionized how developers can write stream processing application without having to reason about having to reason about streaming. It allows the user to express their streaming computations the same way you would express a batch computation on static data. The Spark SQL engine takes care of running it incrementally and continuously updating the final result as streaming data continues to arrive. It truly unifies batch, streaming and interactive processing in the same Datasets/DataFrames API and the same optimized Spark SQL processing engine.
The initial alpha release of Structured Streaming in Apache Spark 2.0 introduced the basic aggregation APIs and files as streaming source and sink. Since we have put in a lot of work to make it ready for production use. In this talk, I am going to talk in more detail about the major features we have added, the recipes for using them in production, and the exciting new features we have plans for in future releases. Some of these features are as follows.
- Design and use of the Kafka Source
- Support for watermarks and event-time processing
- Support for more operations and output modes
June 5, 2017 05:00 PM PT
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where - in less than 10 lines - you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He'll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
October 24, 2017 05:00 PM PT
Stateful processing is one of the most challenging aspects of distributed, fault-tolerant stream processing. The DataFrame APIs in Structured Streaming make it very easy for the developer to express their stateful logic, either implicitly (streaming aggregations) or explicitly (mapGroupsWithState). However, there are a number of moving parts under the hood which makes all the magic possible. In this talk, I am going to dive deeper into how stateful processing works in Structured Streaming. In particular, I am going to discuss the following. - Different stateful operations in Structured Streaming - How state data is stored in a distributed, fault-tolerant manner using State Stores - How you can write custom State Stores for saving state to external storage systems.
Session hashtag: #EUstr7
October 24, 2017 05:00 PM PT
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
October 24, 2017 05:00 PM PT
Last year, in Apache Spark 2.0, Databricks introduced Structured Streaming, a new stream processing engine built on Spark SQL, which revolutionized how developers could write stream processing application. Structured Streaming enables users to express their computations the same way they would express a batch query on static data. Developers can express queries using powerful high-level APIs including DataFrames, Dataset and SQL. Then, the Spark SQL engine is capable of converting these batch-like transformations into an incremental execution plan that can process streaming data, while automatically handling late, out-of-order data and ensuring end-to-end exactly-once fault-tolerance guarantees.
Since Spark 2.0, Databricks has been hard at work building first-class integration with Kafka. With this new connectivity, performing complex, low-latency analytics is now as easy as writing a standard SQL query. This functionality, in addition to the existing connectivity of Spark SQL, makes it easy to analyze data using one unified framework. Users can now seamlessly extract insights from data, independent of whether it is coming from messy / unstructured files, a structured / columnar historical data warehouse, or arriving in real-time from Kafka/Kinesis.
In this session, Das will walk through a concrete example where – in less than 10 lines – you read Kafka, parse JSON payload data into separate columns, transform it, enrich it by joining with static data and write it out as a table ready for batch and ad-hoc queries on up-to-the-last-minute data. He’ll use techniques including event-time based aggregations, arbitrary stateful operations, and automatic state management using event-time watermarks.
Session hashtag: #EUdd1