Fine Grained Demand Forecasting - Spark 3(Python)

Loading...

The objective of this notebook is to illustrate how we might generate a large number of fine-grained forecasts at the store-item level in an efficient manner leveraging the distributed computational power of Databricks. This is a Spark 3.x update to a previously published notebook which had been developed for Spark 2.x. UPDATE marks in this notebook indicate changes in the code intended to reflect new functionality in either Spark 3.x or the Databricks platform.

For this exercise, we will make use of an increasingly popular library for demand forecasting, FBProphet, which we will load into the notebook session associated with a cluster running Databricks 7.1 or higher:

UPDATE With Databricks 7.1, we can now install notebook-scoped libraries using the %pip magic command.

%pip install pystan==2.19.1.1  # per https://github.com/facebook/prophet/commit/82f3399409b7646c49280688f59e5a3d2c936d39#comments
%pip install fbprophet==0.6
Show result

Step 1: Examine the Data

For our training dataset, we will make use of 5-years of store-item unit sales data for 50 items across 10 different stores. This data set is publicly available as part of a past Kaggle competition and can be downloaded here.

Once downloaded, we can unzip the train.csv.zip file and upload the decompressed CSV to /FileStore/demand_forecast/train/ using the file import steps documented here. With the dataset accessible within Databricks, we can now explore it in preparation for modeling:

from pyspark.sql.types import *
 
# structure of the training data set
train_schema = StructType([
  StructField('date', DateType()),
  StructField('store', IntegerType()),
  StructField('item', IntegerType()),
  StructField('sales', IntegerType())
  ])
 
# read the training file into a dataframe
train = spark.read.csv(
  'dbfs:/FileStore/demand_forecast/train/train.csv', 
  header=True, 
  schema=train_schema
  )
 
# make the dataframe queriable as a temporary view
train.createOrReplaceTempView('train')
 
# show data
display(train)
 
date
store
item
sales
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
2013-01-01
1
1
13
2013-01-02
1
1
11
2013-01-03
1
1
14
2013-01-04
1
1
13
2013-01-05
1
1
10
2013-01-06
1
1
12
2013-01-07
1
1
10
2013-01-08
1
1
9
2013-01-09
1
1
12
2013-01-10
1
1
9
2013-01-11
1
1
9
2013-01-12
1
1
7
2013-01-13
1
1
10
2013-01-14
1
1
12
2013-01-15
1
1
5
2013-01-16
1
1
7
2013-01-17
1
1
16

Showing the first 1000 rows.

When performing demand forecasting, we are often interested in general trends and seasonality. Let's start our exploration by examing the annual trend in unit sales:

%sql
 
SELECT
  year(date) as year, 
  sum(sales) as sales
FROM train
GROUP BY year(date)
ORDER BY year;
year2.013k2.014k2.015k2.016k2.017k8.0M8.5M9.0M9.5M10M11Msalessales

It's very clear from the data that there is a generally upward trend in total unit sales across the stores. If we had better knowledge of the markets served by these stores, we might wish to identify whether there is a maximum growth capacity we'd expect to approach over the life of our forecast. But without that knowledge and by just quickly eyeballing this dataset, it feels safe to assume that if our goal is to make a forecast a few days, months or even a year out, we might expect continued linear growth over that time span.

Now let's examine seasonality. If we aggregate the data around the individual months in each year, a distinct yearly seasonal pattern is observed which seems to grow in scale with overall growth in sales:

%sql
 
SELECT 
  TRUNC(date, 'MM') as month,
  SUM(sales) as sales
FROM train
GROUP BY TRUNC(date, 'MM')
ORDER BY month;
month201320152017600k800k1.0Msalessales

Aggregating the data at a weekday level, a pronounced weekly seasonal pattern is observed with a peak on Sunday (weekday 0), a hard drop on Monday (weekday 1) and then a steady pickup over the week heading back to the Sunday high. This pattern seems to be pretty stable across the five years of observations:

UPDATE As part of the Spark 3 move to the Proleptic Gregorian calendar, the 'u' option in CAST(DATE_FORMAT(date, 'u') was removed. We are now using 'E to provide us a similiar output.

%sql
 
SELECT
  YEAR(date) as year,
  (
    CASE
      WHEN DATE_FORMAT(date, 'E') = 'Sun' THEN 0
      WHEN DATE_FORMAT(date, 'E') = 'Mon' THEN 1
      WHEN DATE_FORMAT(date, 'E') = 'Tue' THEN 2
      WHEN DATE_FORMAT(date, 'E') = 'Wed' THEN 3
      WHEN DATE_FORMAT(date, 'E') = 'Thu' THEN 4
      WHEN DATE_FORMAT(date, 'E') = 'Fri' THEN 5
      WHEN DATE_FORMAT(date, 'E') = 'Sat' THEN 6
    END
  ) % 7 as weekday,
  AVG(sales) as sales
FROM (
  SELECT 
    date,
    SUM(sales) as sales
  FROM train
  GROUP BY date
 ) x
GROUP BY year, weekday
ORDER BY year, weekday;
00.511.522.533.544.555.5620k25k30k35k
2013201320142014201520152016201620172017weekdaysalesyearyear

Now that we are oriented to the basic patterns within our data, let's explore how we might build a forecast.

Step 2: Build a Single Forecast

Before attempting to generate forecasts for individual combinations of stores and items, it might be helpful to build a single forecast for no other reason than to orient ourselves to the use of FBProphet.

Our first step is to assemble the historical dataset on which we will train the model:

# query to aggregate data to date (ds) level
sql_statement = '''
  SELECT
    CAST(date as date) as ds,
    sales as y
  FROM train
  WHERE store=1 AND item=1
  ORDER BY ds
  '''
 
# assemble dataset in Pandas dataframe
history_pd = spark.sql(sql_statement).toPandas()
 
# drop any missing records
history_pd = history_pd.dropna()

Now, we will import the fbprophet library, but because it can be a bit verbose when in use, we will need to fine-tune the logging settings in our environment: