Using Apache Spark to Solve Sessionization Problem in Batch and Streaming

Download Slides

Analyzing sessions can bring a lot of useful feedback about what works and what does not. But implementing them is not easy because of data issues and operational costs that you will meet sooner or later. In this talk I will present 2 approaches to compute sessions with Apache Spark and AWS services. The first one will use batch and therefore, Spark SQL, whereas the second streaming and Structured Streaming module.

During the talk I will cover different problems you may encounter when creating sessions, like late data, incomplete dataset, duplicated data, reprocessing or fault-tolerance aspects. I will try to solve them and show how Apache Spark features and AWS services (EMR, S3) can help to do that. After the talk you should be aware of the problems you may encounter with session pipelines and understand how to address them with Apache Spark features like watermarks, state store, checkpoints and how to integrate your code with a cloud provider.

« back
About Bartosz Konieczny

Octo Technology

Bartosz is a data engineer enjoying working with Apache Spark and cloud data services. By day he works as a data engineering consultant at OCTO Technology. By night, he shares his data engineering findings on and