In this solution accelerator, we demonstrate how to use Apache Spark™ and Facebook Prophet™ to build dozens of time series forecasting models in parallel on the Databricks Lakehouse Platform.
Demand forecasting is an essential practice in most organizations. Our goal is to predict how much of individual products we need in specific locations, and at what times, so that we can maximize our sales and minimize costs. The most accurate forecasts are going to take into consideration location-specific patterns associated with the product. But this requires us to produce a large number of location- and product-specific forecasts. Legacy approaches struggle with this – they typically produce these forecasts one at a time. And it’s very difficult to get through all the forecasts that are needed in time to affect our operations. So instead, what a lot of organizations do is they will aggregate their stores, they’ll aggregate their products, they’ll forecast at the aggregate level and then allocate it back down. This process is overly simplified and turns around quickly – but substantially sacrifices accuracy.
Click to expand the transcript →
Click to collapse the transcript →
With Databricks, we can take a different approach. We can take advantage of resources available to us in the cloud, and distribute this work. And we can get it done very fast. The resources we need for this are quickly provisioned, and they’re just as quickly released when they are no longer needed. This is a very cost-effective way that a lot of organizations are now tackling their forecasting needs in the most accurate manner possible. The trick is understanding the pattern for implementing forecasts this way. So we’ve put together this Solution Accelerator to help you accurately forecast.
This is our Solution Accelerator for fine-grained forecasting at scale. In this demo, we use a publicly available data set to generate a forecast for a series of 500 store and item combinations. We take our time to review the data in this publicly available data set, to understand the basic temporal patterns that are within it as part of any good forecasting exercise. We then get to work on building a forecast. Here you’re seeing how we might build a forecast for one store and one product combination. We’re using Facebook Prophet, but in no way are we saying that’s the only way to do this. The key point to understand here is the basic approach that you would use for building a forecast. So if you’re a data scientist and take a look at this, you should quickly understand how we’re approaching this problem using standard open-source libraries and capabilities such as pandas DataFrames. Once you’re comfortable with this, and you’ve generated your first forecast, we can then examine how to scale this, and it’s very simple.
This is what we do: We’re going to take that same logic that we saw before, and we’re going to wrap it inside of a function — you’re seeing that function definition here. Once it’s encapsulated in a function, we can then take advantage of the Databricks platform to read all of our historical data, and group that data by each store and item combination. In effect, each store and item becomes its own individual partition of data that’s now distributed across our cluster. If we want to tackle these 500 store and item combinations using four workers, then the 500 store-item combinations are distributed across the four worker computers inside of our cluster. If we want to be more aggressive and tackle it with 10 workers or 20 workers, or 100 workers, this same code automatically distributes this work across those workers, allowing them to do the work in parallel.
The work that’s applied to it, the actual forecast generation, takes place right here. Using the “applyInPandas” method, we then simply use that function we defined before to then build a forecast for each store and item combination. The results of all those forecasts are returned inside of a singular result set that we can then persist and allow our analysts to scrutinize. The details behind this are captured in a notebook that’s accessible down here at the bottom. Inside of here, you will see the detailed code that is required to implement this work. You will also find links to the data sets that allow you to recreate this in your environment, and try how this works and performs for you. You will find code samples and explanations of what we’re doing so that you can understand the process better and then translate it to your own needs.
For a lot of organizations who have had to compromise on their forecasting, this will be a huge time-saver and asset to their business. We have organizations today that are using Databricks to scale out to hundreds of thousands, or even multiple millions of product-location-specific forecasts on a daily basis. So we encourage you to give it a try, and see how it can impact your business.