A Case Study in Rearchitecting an On-Premises Pipeline in the Cloud
- Data Engineering
- Public Sector
- Moscone South | Level 3 | 314
- 35 min
This talk is a case study in migrating on-premises data pipelines to a cloud environment. In our instance, we wished to rebuild a pipeline which included both streaming and batch components. The streaming component performed field extractions and stored the resulting raw data to HDFS, where a Spark job picked it up and applied aggregations for later use by analysts.
We were able to replicate the streaming portion of the pipeline in Azure with a combination of Logstash deployed via Kubernetes, Azure Eventhubs, Azure Functions, and Blob Storage. We then used batch jobs, written in Python with Pandas and handled by Prefect, to replicate the aggregations. Finally, we made the data available to our analysts in the cloud via Azure Databricks.
In this talk, I will discuss the practical design decisions behind these choices, as well as the technical challenges we encountered while recreating the pipeline. I will also point out the many lessons we learned along the way to successfully migrating the pipeline, and how we were able to apply these lessons to similar projects that came up later.