Sqoop on Spark for Data Ingestion

Download Slides

Apache Sqoop has been used primarily for transfer of data between relational databases and HDFS, leveraging the Hadoop Mapreduce engine. Recently the Sqoop community has made changes to allow data transfer across any two data sources represented in code by Sqoop connectors. For instance, it’s possible to use the latest Apache Sqoop to transfer data from MySQL to kafka or vice versa via the jdbc connector and kafka connector, respectively.

This talk will focus on running Sqoop jobs on Apache Spark engine and proposed extensions to the APIs to use the Spark functionality. We’ll discuss the design options explored and implemented to submit jobs to the Spark engine. We’ll do a demo of one of the Sqoop job flows on Apache spark and how to use the Sqoop job APIs to monitor the Sqoop jobs. The talk will conclude use cases for Sqoop and Spark at Uber.

Learn more:

  • Apache Spark and Hadoop: Working Together

    « back
  • About Vinoth Chandar

    Vinoth is the founding engineer/architect of the data team at Uber, as well as author of many data processing & querying systems at Uber, including "Hoodie". He has keen interest in unified architectures for data analytics and processing. Previously, Vinoth was the lead on Linkedin’s Voldemort key value store and has also worked on Oracle Database replication engine, HPC, and stream processing.