HomepageData + AI Summit 2022 Logo
Watch on demand

A Case Study in Rearchitecting an On-Premises Pipeline in the Cloud

On Demand

Type

  • Session

Format

  • In-Person

Track

  • Data Engineering

Industry

  • Public Sector

Difficulty

  • Intermediate

Room

  • Moscone South | Level 3 | 314

Duration

  • 35 min
Download session slides

Overview

This talk is a case study in migrating on-premises data pipelines to a cloud environment. In our instance, we wished to rebuild a pipeline which included both streaming and batch components. The streaming component performed field extractions and stored the resulting raw data to HDFS, where a Spark job picked it up and applied aggregations for later use by analysts.

We were able to replicate the streaming portion of the pipeline in Azure with a combination of Logstash deployed via Kubernetes, Azure Eventhubs, Azure Functions, and Blob Storage. We then used batch jobs, written in Python with Pandas and handled by Prefect, to replicate the aggregations. Finally, we made the data available to our analysts in the cloud via Azure Databricks.

In this talk, I will discuss the practical design decisions behind these choices, as well as the technical challenges we encountered while recreating the pipeline. I will also point out the many lessons we learned along the way to successfully migrating the pipeline, and how we were able to apply these lessons to similar projects that came up later.

Session Speakers

Mary Clair Thompson

Lead Data Engineer

Duke University

See the best of Data+AI Summit

Watch on demand