September 6, 2016

Writing Data Engineering Pipelines in Apache Spark on Databricks

This is part 3 of a 3 part series providing a gentle introduction to writing Apache Spark applications on Databricks.

The big advantage of running Apache Spark on Databricks for data engineers is that it’s an easy tool to plug and play with an entire ecosystem of databases, tools, and environments. Building robust pipelines is simple because you can work with a smaller amount of data to prototype and scale to nearly any size of data you throw at it. Beyond sheer data volume, there are many spark packages for connecting to a variety of different data sources from SQL to NoSQL and everything in between.

Databricks takes this to the next level by providing a holistic set of managed services such as notebooks, a sophisticated API, production jobs and workflows, and continuous integration use cases. Take a look at demonstrations from companies building production systems on Apache Spark and Databricks like Swoop, Riot Games, and Conviva.

In the previous guides in this series, we provided an introduction to writing Apache Spark applications on Databricks. Be sure to check it out if you have not already! The second guide was geared towards the workflow of a data scientist - showing the process of iterative data analytics towards building a predictive model using datasets from the US Department of Agriculture and the Internal Revenue Service.

This third guide is meant for the data engineer, who needs a simple, production-ready guide to show them how to take raw data, transform it through Python, SQL or Scala, create and register Apache Spark UDFs, and finally write that data to multiple locations using Apache Spark’s data source APIs. While these tasks are made simpler with Spark, this example will show how Databricks makes it even easier for a data engineer to take a prototype to production.

What’s in this guide

The guide illustrates how to import data and build a robust Apache Spark data pipeline on Databricks. We’ll walk through building simple log pipeline from the raw logs all the way to placing this data into permanent storage. In the process, we will demonstrate common tasks data engineers have to perform in an ETL pipeline, such as getting raw data from a variety of sources like Amazon S3, converting the data to Parquet format, and putting it into a data warehouse.

Diagram of the guide

This end-to-end tutorial will also shed light on common challenges in data engineering and provide solutions to them. This includes working with a variety of different languages and handling non-standard datetimes in UDFs in Scala and SparkSQL.

The guide gives you an example of a stable ETL pipeline that we’ll be able to put right into production with Databricks’ Job Scheduler. This guide will go through:

We’ll create a function in Python that will convert raw Apache logs sitting in an S3 bucket to a DataFrame.
Next, we’ll enumerate all the ways to create a UDF in Scala. This will allow us to use it as both an SQL function as well as on Scala DataFrames.
After joining that DataFrame with another one, we’ll write data out to permanent storage in Redshift as well as a backup copy in Amazon S3.

What’s next

You can work through the examples in this guide with the Databricks platform (Sign-up to try for free). For more resources to help you get started with Apache Spark, check out our introduction guide.

Get the latest posts in your inbox

Subscribe to our blog and get the latest posts delivered to your inbox.

View all blogs

What’s in this guide

What’s next

Get the latest posts in your inbox

Sign up