Fine Grained Demand Forecasting - Spark 3

%pip install pystan==2.19.1.1  # per https://github.com/facebook/prophet/commit/82f3399409b7646c49280688f59e5a3d2c936d39#comments%pip install fbprophet==0.6

Show result

Step 1: Examine the Data

For our training dataset, we will make use of 5-years of store-item unit sales data for 50 items across 10 different stores. This data set is publicly available as part of a past Kaggle competition and can be downloaded here.

Once downloaded, we can unzip the train.csv.zip file and upload the decompressed CSV to /FileStore/demand_forecast/train/ using the file import steps documented here. With the dataset accessible within Databricks, we can now explore it in preparation for modeling:

from pyspark.sql.types import * # structure of the training data settrain_schema = StructType([  StructField('date', DateType()),  StructField('store', IntegerType()),  StructField('item', IntegerType()),  StructField('sales', IntegerType())  ]) # read the training file into a dataframetrain = spark.read.csv(  'dbfs:/FileStore/demand_forecast/train/train.csv',   header=True,   schema=train_schema  ) # make the dataframe queriable as a temporary viewtrain.createOrReplaceTempView('train') # show datadisplay(train)

%sql SELECT  year(date) as year,   sum(sales) as salesFROM trainGROUP BY year(date)ORDER BY year;

It's very clear from the data that there is a generally upward trend in total unit sales across the stores. If we had better knowledge of the markets served by these stores, we might wish to identify whether there is a maximum growth capacity we'd expect to approach over the life of our forecast. But without that knowledge and by just quickly eyeballing this dataset, it feels safe to assume that if our goal is to make a forecast a few days, months or even a year out, we might expect continued linear growth over that time span.

Now let's examine seasonality. If we aggregate the data around the individual months in each year, a distinct yearly seasonal pattern is observed which seems to grow in scale with overall growth in sales:

%sql SELECT   TRUNC(date, 'MM') as month,  SUM(sales) as salesFROM trainGROUP BY TRUNC(date, 'MM')ORDER BY month;

%sql SELECT  YEAR(date) as year,  (    CASE      WHEN DATE_FORMAT(date, 'E') = 'Sun' THEN 0      WHEN DATE_FORMAT(date, 'E') = 'Mon' THEN 1      WHEN DATE_FORMAT(date, 'E') = 'Tue' THEN 2      WHEN DATE_FORMAT(date, 'E') = 'Wed' THEN 3      WHEN DATE_FORMAT(date, 'E') = 'Thu' THEN 4      WHEN DATE_FORMAT(date, 'E') = 'Fri' THEN 5      WHEN DATE_FORMAT(date, 'E') = 'Sat' THEN 6    END  ) % 7 as weekday,  AVG(sales) as salesFROM (  SELECT     date,    SUM(sales) as sales  FROM train  GROUP BY date ) xGROUP BY year, weekdayORDER BY year, weekday;

# query to aggregate data to date (ds) levelsql_statement = '''  SELECT    CAST(date as date) as ds,    sales as y  FROM train  WHERE store=1 AND item=1  ORDER BY ds  ''' # assemble dataset in Pandas dataframehistory_pd = spark.sql(sql_statement).toPandas() # drop any missing recordshistory_pd = history_pd.dropna()

Fine Grained Demand Forecasting - Spark 3(Python)

Step 1: Examine the Data

Step 2: Build a Single Forecast