When providing recommendations to shoppers on what to purchase, you are often looking for items that are frequently purchased together (e.g. peanut butter and jelly). A key technique to uncover associations between different items is known as market basket analysis. In your recommendation engine toolbox, the association rules generated by market basket analysis (e.g. if one purchases peanut butter, then they are likely to purchase jelly) is an important and useful technique. With the rapid growth e-commerce data, it is necessary to execute models like market basket analysis on increasing larger sizes of data. That is, it will be important to have the algorithms and infrastructure necessary to generate your association rules on a distributed platform. In this blog post, we will discuss how you can quickly run your market basket analysis using Apache Spark MLlib
FP-growth algorithm on Databricks.
To showcase this, we will use the publicly available Instacart Online Grocery Shopping Dataset 2017. In the process, we will explore the dataset as well as perform our market basket analysis to recommend shoppers to buy it again or recommend to buy new items.
- Ingest your data: Bringing in the data from your source systems; often involving ETL processes (though we will bypass this step in this demo for brevity)
- Explore your data using Spark SQL: Now that you have cleansed data, explore it so you can get some business insight
- Train your ML model using FP-growth: Execute FP-growth to execute your frequent pattern mining algorithm
- Review the association rules generated by the ML model for your recommendations
The dataset we will be working with is 3 Million Instacart Orders, Open Sourced dataset:
The "Instacart Online Grocery Shopping Dataset 2017”, Accessed from https://www.instacart.com/datasets/grocery-shopping-2017 on 01/17/2018. This anonymized dataset contains a sample of over 3 million grocery orders from more than 200,000 Instacart users. For each user, we provide between 4 and 100 of their orders, with the sequence of products purchased in each order. We also provide the week and hour of day the order was placed, and a relative measure of time between orders.
You will need to download the file, extract the files from the gzipped TAR archive, and upload them into Databricks DBFS using the Import Data utilities. You should see the following files within
dbfs once the files are uploaded:
Orders: 3.4M rows, 206K users
Products: 50K rows
Aisles: 134 rows
Departments: 21 rows
order_products__SET: 30M+ rows where
SETis defined as:
prior: 3.2M previous orders
train: 131K orders for your training dataset
Refer to the Instacart Online Grocery Shopping Dataset 2017 Data Descriptions for more information including the schema.
Now that you have uploaded your data to dbfs, you can quickly and easily create your DataFrames using
# Import Data aisles = spark.read.csv("/mnt/bhavin/mba/instacart/csv/aisles.csv", header=True, inferSchema=True) departments = spark.read.csv("/mnt/bhavin/mba/instacart/csv/departments.csv", header=True, inferSchema=True) order_products_prior = spark.read.csv("/mnt/bhavin/mba/instacart/csv/order_products__prior.csv", header=True, inferSchema=True) order_products_train = spark.read.csv("/mnt/bhavin/mba/instacart/csv/order_products__train.csv", header=True, inferSchema=True) orders = spark.read.csv("/mnt/bhavin/mba/instacart/csv/orders.csv", header=True, inferSchema=True) products = spark.read.csv("/mnt/bhavin/mba/instacart/csv/products.csv", header=True, inferSchema=True) # Create Temporary Tables aisles.createOrReplaceTempView("aisles") departments.createOrReplaceTempView("departments") order_products_prior.createOrReplaceTempView("order_products_prior") order_products_train.createOrReplaceTempView("order_products_train") orders.createOrReplaceTempView("orders") products.createOrReplaceTempView("products")
Exploratory Data Analysis
Now that you have created DataFrames, you can perform exploratory data analysis using Spark SQL. The following queries showcase some of the quick insight you can gain from the Instacart dataset.
Orders by Day of Week
The following query allows you to quickly visualize that Sunday is the most popular day for the total number of orders while Thursday has the least number of orders.
%sql select count(order_id) as total_orders, (case when order_dow = '0' then 'Sunday' when order_dow = '1' then 'Monday' when order_dow = '2' then 'Tuesday' when order_dow = '3' then 'Wednesday' when order_dow = '4' then 'Thursday' when order_dow = '5' then 'Friday' when order_dow = '6' then 'Saturday' end) as day_of_week from orders group by order_dow order by total_orders desc
Orders by Hour
When breaking down the hours typically people are ordering their groceries from Instacart during business working hours with highest number orders at 10:00am.
%sql select count(order_id) as total_orders, order_hour_of_day as hour from orders group by order_hour_of_day order by order_hour_of_day
Understand shelf space by department
As we dive deeper into our market basket analysis, we can gain insight on the number of products by department to understand how much shelf space is being used.
%sql select d.department, count(distinct p.product_id) as products from products p inner join departments d on d.department_id = p.department_id group by d.department order by products desc limit 10
As can see from the preceding image, typically the number of unique items (i.e. products) involve personal care and snacks.
Organize Shopping Basket
To prepare our data for downstream processing, we will organize our data by shopping basket. That is, each row of our DataFrame represents an
order_id with each
items column containing an array of items.
# Organize the data by shopping basket from pyspark.sql.functions import collect_set, col, count rawData = spark.sql("select p.product_name, o.order_id from products p inner join order_products_train o where o.product_id = p.product_id") baskets = rawData.groupBy('order_id').agg(collect_set('product_name').alias('items')) baskets.createOrReplaceTempView('baskets')
Just like the preceding graphs, we can visualize the nested items using the
display command in our Databricks notebooks.
Train ML Model
To understand the frequency of items are associated with each other (e.g. how many times does peanut butter and jelly get purchased together), we will use association rule mining for market basket analysis. Spark MLlib implements two algorithms related to frequency pattern mining (FPM):
PrefixSpan. The distinction is that
FP-growth does not use order information in the itemsets, if any, while
PrefixSpan is designed for sequential pattern mining where the itemsets are ordered. We will use
FP-growth as the order information is not important for this use case.
Note, we will be using the Scala API so we can configure setMinConfidence
%scala import org.apache.spark.ml.fpm.FPGrowth // Extract out the items val baskets_ds = spark.sql("select items from baskets").as[Array[String]].toDF("items") // Use FPGrowth val fpgrowth = new FPGrowth().setItemsCol("items").setMinSupport(0.001).setMinConfidence(0) val model = fpgrowth.fit(baskets_ds) // Calculate frequent itemsets val mostPopularItemInABasket = model.freqItemsets mostPopularItemInABasket.createOrReplaceTempView("mostPopularItemInABasket")
With Databricks notebooks, you can use the %scala to execute Scala code within a new cell in the same Python notebook.
mostPopularItemInABasket DataFrame created, we can use Spark SQL to query for the most popular items in a basket where there are more than 2 items with the following query.
%sql select items, freq from mostPopularItemInABasket where size(items) > 2 order by freq desc limit 20
As can be seen in the preceding table, the most frequent purchases of more than two items involve organic avocados, organic strawberries, and organic bananas. Interesting, the top five frequently purchased together items involve various permutations of organic avocados, organic strawberries, organic bananas, organic raspberries, and organic baby spinach. From the perspective of recommendations, the
freqItemsets can be the basis for the buy-it-again recommendation in that if a shopper has purchased the items previously, it makes sense to recommend that they purchase it again.
Review Association Rules
In addition to
FP-growth model also generates
associationRules. For example, if a shopper purchases peanut butter, what is the probability (or confidence) that they will also purchase jelly. For more information, a good reference is Susan Li's A Gentle Introduction on Market Basket Analysis — Association Rules.
%scala // Display generated association rules. val ifThen = model.associationRules ifThen.createOrReplaceTempView("ifThen")
A good way to think about association rules is that model determines that if you purchased something (i.e. the antecedent), then you will purchase this other thing (i.e. the consequent) with the following confidence.
%sql select antecedent as `antecedent (if)`, consequent as `consequent (then)`, confidence from ifThen order by confidence desc limit 20
As can be seen in the preceding graph, there is relatively strong confidence that if a shopper has organic raspberries, organic avocados, and organic strawberries in their basket, then it may make sense to recommend organic bananas as well. Interestingly, the top 10 (based on descending confidence) association rules - i.e. purchase recommendations - are associated with organic bananas or bananas.
In summary, we demonstrated how to explore our shopping cart data and execute market basket analysis to identify items frequently purchased together as well as generating association rules. By using Databricks, in the same notebook we can visualize our data; execute Python, Scala, and SQL; and run our
FP-growth algorithm on an auto-scaling distributed Spark cluster - all managed by Databricks. Putting these components together simplifies the data flow and management of your infrastructure for you and your data practitioners. Try out the Market Basket Analysis using Instacart Online Grocery Dataset with Databricks today.