Dr. Andrew Ray is a Principal Data Engineer at Silicon Valley Data Science. He enjoys working at the intersection of engineering and data science. Andrew is an active contributor to the Apache Spark project. In his past life Andrew was a Data Scientist at Walmart, where he built an analytics platform on Hadoop that integrated data from multiple retail channels using fuzzy matching and graph algorithms. Andrew also led the adoption of Spark at Walmart from proof-of-concept to production. Andrew earned his Ph.D. in Mathematics from the University of Nebraska, where he worked on extremal graph theory.
April 23, 2019 05:00 PM PT
At Sams Club we have a long history of using Apache Spark and Hadoop. Projects from all parts of the company use Apache Spark, from fraud detection to product recommendations. Because of the scale of our business with billions of transactions and trillions of events it is often essential to use big data technologies. Until recently all of this work has run on several large on-premise Hadoop clusters.
As part of our transition to public cloud we needed to build out an enterprise scale data platform. Azure Databricks is a key component of this platform giving our data scientist, engineers, and business users the ability to easily work with the companies data. We will discuss our architecture considerations that lead to using multiple Databricks workspaces and external Azure blob storage.
We will also discuss how we move massive amounts of data to Azure on a daily basis with Airflow. Further we will discuss the self-service tools that we created to help users get their data to Azure and for us to manage the platform. Finally we will discuss our security considerations and how that played out in our architecture.