A Case Study in Rearchitecting an On-Premises Pipeline in the Cloud
On Demand
Type
- Session
Format
- In-Person
Track
- Data Engineering
Industry
- Public Sector
Difficulty
- Intermediate
Room
- Moscone South | Level 3 | 314
Duration
- 35 min
Overview
This talk is a case study in migrating on-premises data pipelines to a cloud environment. In our instance, we wished to rebuild a pipeline which included both streaming and batch components. The streaming component performed field extractions and stored the resulting raw data to HDFS, where a Spark job picked it up and applied aggregations for later use by analysts.
We were able to replicate the streaming portion of the pipeline in Azure with a combination of Logstash deployed via Kubernetes, Azure Eventhubs, Azure Functions, and Blob Storage. We then used batch jobs, written in Python with Pandas and handled by Prefect, to replicate the aggregations. Finally, we made the data available to our analysts in the cloud via Azure Databricks.
In this talk, I will discuss the practical design decisions behind these choices, as well as the technical challenges we encountered while recreating the pipeline. I will also point out the many lessons we learned along the way to successfully migrating the pipeline, and how we were able to apply these lessons to similar projects that came up later.
We were able to replicate the streaming portion of the pipeline in Azure with a combination of Logstash deployed via Kubernetes, Azure Eventhubs, Azure Functions, and Blob Storage. We then used batch jobs, written in Python with Pandas and handled by Prefect, to replicate the aggregations. Finally, we made the data available to our analysts in the cloud via Azure Databricks.
In this talk, I will discuss the practical design decisions behind these choices, as well as the technical challenges we encountered while recreating the pipeline. I will also point out the many lessons we learned along the way to successfully migrating the pipeline, and how we were able to apply these lessons to similar projects that came up later.
Session Speakers
See the best of Data+AI Summit
Watch on demand