As Atlassian continues to scale to more and more customers, the demand for our legendary support continues to grow. Atlassian needs to maintain balance between the staffing levels needed to service this increasing support ticket volume with the budgetary constraints needed to keep the business healthy – automated ticket volume forecasting is at the centre of this delicate balance. In this talk, Perry will:
– Hi, my name’s Perry and I’m a Senior Software Engineer at Atlassian. Before I jumped into the content, I thought I’d give you a little bit of background about why I’m presenting on this topic today. The main reason I’m speaking today is because I’m an empowered end user and I want to help empower others. I built this pipeline using the Databricks platform, which has everything I needed to mill all of this without any help from rendering. There’s no sneaky backend tax here. It’s all Databricks notebooks, Delta Lake and MLflow.
The second reason I’m speaking is because I think this combination of tools offer some really interesting opportunities to make a big impact on reproducibility. I don’t have all of the answers, but I’d like to present one opinion on how notebooks, Delta Lake, and Mlflow can help improve the reproducibility of your pipelines.
And finally, I just love working with Databricks, and I want to share some of the reasons why I think it’s such a great platform for data science. Everything I’m going to show you can be done without Databricks. You can do it using opensource tools, but it’s always nice to use great tools.
So a little bit of a caveat before we stop. I’m real, this is a real project, and that means it’s a bit messy, but hopefully you get some useful ideas and inspiration.
And typically we would pause here for a moment to talk about how this project fits into our business and what we’re trying to achieve, but I only have 30 minutes, so that’s not happening. All you need to know is that we’re trying to forecast support ticket volumes across a bunch of different groupings.
What you do need to know is that forecasting is a special case for machine learning. Forecasting models are short lived, they’re are femoral, and it’s the pipeline that matters because you need to retrain every time you make a prediction. If you haven’t worked with the time series forecasting model before, and that statement might not make sense. So let’s take a quick look at how time series forecasting works with Prophet.
When you see a time series forecasting model like Prophet, what you’re normally doing is taking historical values of a time varying signal and fitting a model that explains that variation over history. And then you’re crossing your fingers and hoping that it tells you something interesting about the future. Here, you can see the historical data points plotted on the map as dots and the shaded range represents the fitted model. The model of Prophet has fitted. I’ve also plotted the center of the forecast range here, which is effectively the median forecast, the median of that forecasting range. We can use a plot like this to assess the model fit and make a determination about how well it generalizes into the future. You can see that I’ve used a holdout set to evaluate the model fit on the future data. But how useful is this model? We’ve got a high quality model that can forecasr data that we’ve already observed. It can forecast backwards really well, but how well can it predict the future? And by that, I mean, how well can it predict the real future, which is beyond the whole outset?
And it turns out that the answer is actually not very well. Time series forecast perform best when they aren’t forecasting the distant future. The forecast variation over the next few periods is much tighter than the forecast variation in the distant future. If we were to treat forecasting like traditional machine learning with a training set and a holdout set, and then promoting the model to production, then we’ve just caused you problems.
Firstly, we’re gonna be using the part of the forecast where the variation starts to go wide. We’re gonna have much less certainty in the forecast once we start looking six months ahead, but we’ve also set a trap for ourselves because we’d be evaluating the forecast accuracy on this whole outset, which has much less variance than the section of the forecast that we’re actually gonna be using to support our business decisions. So we’d be lying to ourselves about the forecast accuracy and overestimating our confidence about the future.
And it gets even worse when we present the forecast like this, because we know exactly what’s gonna happen. No matter how much we emphasize that the forecast includes the whole shaded area, we just know that the downstream consumers are gonna look at that median forecast line and trade it like as the expected value. And whilst that’s always a bad thing to do, even when the forecast variation is small, it’s even worse when the forecast variation is large.
So hopefully I’ve convinced you that forecasting is a little bit different to typical machine learning. And so our development process also needs to be a bit different. In order to build a forecasting system that’s actually useful for the business, we’re gonna need to do two things.
The first is that instead of thinking of everything in terms of models, we need to start thinking in terms of pipelines. The objective here is to build a pipeline that we can depend on. It needs to be end-to-end covering everything from data preparation, to generating and publishing the forecasts, but it also needs to be easy to run. We need to be able to iterate on this pipeline in the way that a data scientist would normally iterate on a model.
The second thing is that we need to be able to perform testing on the whole pipeline in the same way that we normally perform testing on individual models. And that means that we need to be able to run the pipeline and generate a forecast to any day in the past so that we can evaluate the performance of the whole pipeline. And by performance, I don’t mean minimizing some objective function, I mean, how well the pipeline meets the needs of the business. And a lot of that evaluation has to be done manually. I’ll touch on this later in the presentation, but we need to make sure exposing enough information for the users and maintainers to debug the model when things look strange.
So I need to make sure I support people with different learning styles as well. So for those who prefer learning via memes, this is a short summary of what I just said. The whole pipeline needs to be part of the production workflow, not just the model scoring step.
So from the top down, the process looks a bit like this. We do some one off of muddling data set creation tasks, but then we schedule the dates at the modeling data set, we schedule the model training and we schedule the scoring. The whole pipeline is scheduled. The model isn’t at the center of the process anymore. The pipeline is at the center of the process. And that finally brings me around to the agenda for this tool where are we gonna dig into the technologies that make this work.
We’ll start by talking about Delta Lake, which is the tool that we’re using for the modeling data set and also for the model outputs.
Then we’ll take a look at MLflow tracking.
This is where we’re managing the fleet of models that we need to generate the forecast.
Then we’ll look at notebook workflows where we’ve taken an imperfect process and made it robust and reproducible using notebooks. And finally, we’ll review the pipeline to see how well it’s achieved its objectives.
So let’s jump straight into Delta Lake.
Going back to that high level view, we’re using Delta Lake directly in step one and two.
For step one, we’re doing a one off creation of a modeling data set, which is all just boring SQL. So what I’m showing you here is the important part of the syntax using Delta. This is how you create a Delta table in Databricks, and it’s super easy.
But this is where it gets really cool.
Every month we run a scheduled job to update the modeling data set, but that update doesn’t just overwrite the table, it merges changes and retains history using Delta Lake.
So let’s take a bit of a detour and talk about this Delta Lake thing.
As always the first step when you’re trying to figure out why you’re supposed to care about a new technology, is to go to the website and see how it describes itself. Now, I did that for Delta Lake and I still hate some things that look really important to data engineers, but I just didn’t care about them at all. From a data science point of view, this is all fake news.
From our point of view, Delta Lake gives you versioned tables for reproducible data science. I’m gonna repeat that again for emphasis, because it’s really important. For data science, you care about Delta Lake because it gives you versioned tables for reproducible data science.
Just a quick example of how versions work, check this out. I’m using the same count star query, the same table, but I’m selecting different versions of that. And we can see here that the two versions of the table have different row counts.
I think that’s pretty cool.
So how do you make changes to a table so that we can make use of these versions? It’s pretty easy and it looks a little bit like a join, except that you use this merge into syntax and the join conditions go at the top of your query before the update logic. The update logic is pretty simple in most cases. Here, I’m just updating any rows that match and inserting any rows that don’t. It’s really easy to remove non-matching rows as well, but my data set is never gonna need non-matching rows removed, so I haven’t shown that here. In this example, I’ve also created a temporary view with the latest data and I’ve used query here to update the table on disc so that it matches that temporary view.
And of course you can query the history directly to find out what versions of the table exist for your table, and you can also use timestamps as well instead of version numbers. So if you want to just skip straight to answering, how did this table look on this date? You can do that as well, which I think is pretty cool.
So now we come back to the forecasting pipeline. So with Delta, we create a new table every time the pipeline runs. The latest version, which is also the default version, if you query the table without a version number is always the most up-to-date and the most accurate version of the table.
We can recover any previous version of the training data set. And I think that’s a massive win for reproducibility because I can easily recover the data used to train any of my models. And it only took a few magic words and a few SQL queries.
I think that’s pretty cool.
So it was pretty easy. We’re gonna smash through this agenda in no time at all.
Now, we’re gonna take a look at the model training step.
I don’t have time to go into the code where I’m training the model, but it’s pretty straightforward. Like most of the machine learning these days, it’s really just a function call. And we covered the basics at the start of this presentation. So I’m using Facebook’s Prophet forecasting package. Via R, I’m using Spark+AI as well. You could just as easily do this in Python.
The pipeline trains a model for every ticket queue and stores the model binary plus the model metadata in MLflow.
The pipeline also takes an argument which represents the forecast state, which we’ll use for backfilling. And we also provide a default value, which automatically selects the last day of the previous month.
I’m never ashamed to reuse a name. This time, we’ll take a short detour to have a look good at how an MLflow works.
And just like with Delta, let’s take a look at the website for MLflow and see how it describes itself.
In this case, it’s a bit more targeted at the data science audience. And it’s clear that there are four main things that are going on here. I don’t have time to dig into all of these except to say that I really liked the fact that MLflow lets you just use the models and components that you need. In my case, I got a heap of value from the tracking component. So that’s the main part that I used. And that’s what I’m gonna dig into today.
Teams/set/Forecasting/3. Trair Forecasting Mode s> Run bif80bb173b84d538fbeec303515768c.
Now, if you’ll allow me a little bit of narrative lessons, I’m gonna magically skip over a bunch of stuff because this is the pot of gold at the end of the rainbow and it should help make the rest of the section make a bit more sense. This is the objective. This is what I wanted to get to at the end, I’m using MLflow as a database for models.
So this is the primary key. Every forecast that I generate has a region, has a date, which is just a lot of the latest day included in the training data, support ticket grouping, and a platform. And this is obviously unique to our business and unique to our problem and trying to solve, but the concept should make sense.
I’ve also added this batch number here. Now it’s a bit of a hack, it’s sort of pot mobile versioning pot state management. Effectively, what I’m doing is that the scoring notebook always uses the latest batch number, which lets me do backfill as you’ll see later. But it also means that if you’re running the training notebook followed by the scoring notebook, I don’t need to pass any data between those two notebooks. The scoring one will just always take the latest batch number.
And then this is really cool. For every model, I’m able to track the table and the Delta version, which means I can recover the exact training data set used for every single model that’s sorted in MLflow. This is a massive win for reproducibility, but even more useful is the fact that it makes it possible to debug when something goes wrong and a new forecast looks odd.
We’re also keeping the model itself in MLflow. I did this just using native art. I just dumped it out as a binary file in audience format and then added it manually as an artifact. MLflow has some really cool tooling around this, for Python especially, which would have made this a lot easier. I wouldn’t have even had to write it out to a file, but given that I was in awe, it was just as easy to roll my own to this. Obviously there’s a heat more I could do here in particular. I should probably generate a bunch of diagnostic plots and store them here as well, but I never got around to it. And just to prove that I didn’t intend to here is a genuine extract from my task list, a task list where you can see, I really did plan to generate and store model diagnostic plots, and maybe one day I’ll get around to doing it, but it’s super easy. You can see that you just dump out files that dumped in that as P and G and you can just add them as artifacts. It’s really easy to store objects in other records.
Now, if you haven’t seen how MLflow logging works, it’s pretty easy. All of the examples online are in Python. So I’m gonna stick with art here and I’ll show you something a little bit different. Essentially, within your training loop, you need to tell MLflow when you’re starting your run, and then again when you finished your run and then in between these two API calls, you can send parameters, you can send metrics, which I haven’t shown here, anything artifacts to MLflow, and we’ll add those to the database like I showed you before. It’s pretty easy to do here in R, it’s even easier in Python because you can do it inside a context manager. One thing that might not be obvious is understanding where your experiment lives. You can see here, I’ve used a hard coded experiment ID, and I think it’s put diving in to find out where that came from. So in this case, I’ve attached the experiment directly to my notebook. Every notebook in Databricks has an experiment attached, an MLflow experiment attached. And you can just click Runs at the top right of your notebook. And then you can click on the link sharing here to jump into the UI for that MLflow experiment.
And that brings you to the MLflow tracking page for your notebook. The experiment ID is right there at the top. And if you log into MLflow without specifying experiment ID, then this is where all of your runs are gonna end up. So in my example, I specified the ID for flexibility so that I could move the experiment to a standalone MLflow experiment in future if I wanted to, but I ended up just using the default one for my notebook.
This is what my actual experiment looks like. So there are many hundreds of MLflow runs in here probably well into the thousands by now. And you can inspect more interactively if you ever need to. And I really liked this feature ’cause instead of hiding things in an opaque database or behind an API, this user interface helps you dig in and figure out how everything works. It helps with debugging, it helps with learning the system.
And because it’s attached to my notebook, I can also see them here in the right rail. Although it’s not particularly useful at this scale with thousands. This view is much more useful than traditional machine learning workflows when you’re iterating on a model rather than iterating on the whole pipeline and generating hundreds of models per month.
So now we can take another look at this overview, but added a few more layers.
You can see how I’m using Delta to transfer state between the first three steps and using MLflow to transfer state from the model training to the model scoring step, using that batch number that I showed you before. I’m also using Delta to publish all of the results back into our Delta Lake. Now, this isn’t bulletproof. Obviously I’m opening myself up to certain types of bugs, mainly due to potential race conditions. But I plugged most of those gaps by looking down permissions and preventing other people from writing to my tables or writing records to my MLflow experiment. It’s not perfect, but it’s been pretty solid since I put it together six months ago. It’s good enough for the purpose for which it was built.
So now we’re gonna look at that final scoring step in more detail, where the pipeline automatically uses the models to generate a forecast and publish it as a table to our Delta Lake.
So firstly, the pipeline reads from MLflow to recover all of the models in the latest batch. We need to do two passes here, one to recover all of the runs in the experiment, and then we filter out for the latest batch and we do another pass to retrieve each of the parameters for that model. Then we use these models and the parameters to prepare an execution plan, which is really just a table and a full loop. Then we loop all the eight models, generate a forward looking forecast and do a few aggregations to prepare them for consumption.
Next, we append the results to the existing results table in our Delta Lake, or we overwrite the existing results if we’re rewriting the pipeline. So if we re running it for the same period of time, we’ll overwrite those results instead of inserting them. We use Delta for this step is as well, not because we need the history, but because it gives us a really easy ability to overwrite the existing data using that and merge syntax. We do have the history as well, if that’s useful for debugging.
And because each of these models has an MLflow URL, we can stick that in the table as well, and I’ll come back to why we do this shortly.
Now, reading from Mlflow, it’s really easy. It’s just as you’d expect, it’s quite similar to how we wrote into MLflow in the first place. I’m not gonna dwell on this, it’s all in the documentation, but it’s very similar to writing into MLflow. What does matter though, is the output. I use Delta Lake again here by uploading my results from R into a temporary table in Spark. And then I update the output table so that it reflects the forecast from the latest model run. I’ve broken this out into two steps here just to be absolutely certain that if the scoring job gets through run at any point, it always completely wipes out the scores for the latest forecasting date, and then it writes them again. In fact, the whole pipeline is set up like this. Each individual step can be run or rerun as many times as I like, and it won’t ever break anything. Again, this is just to make debugging easier.
So let’s talk about how all of these runs are scheduled.
I’m gonna make sure I get my value for money from this name, but this time, it’s not really a detail. We’re gonna take a look at how notebook workflows are specifically being used in this pipeline.
This is where it all starts. This is a single schedule job which runs once per month.
That notebook uses the API to run each of the other note books in order. So I’m basically turning four notebooks into a single threaded process. And this helps to avoid race conditions. Although speaking of race conditions, there was actually a bug in the MLflow R package, which was causing some of my jobs to fail occasionally. So you can see here, I’ve wrapped all of my jobs in this run with retry function, which I’ll cover in a minute, but it automatically retries the job in the event of a failure. Now it’s still single-threaded, which means that it retries the job until it succeeds, and then it moves on to the next job. And I just want, I will say success at the bottom that’s because you can actually return a success string from a notebook.
So you can see here that all of my notebooks are returning this string Success if they make it all the way to the end. If they don’t, they actually return a failure code, which is pretty useful. And you can see here how that run with retry function is defined in Python. It catches those errors as they get returned from the failing notebook and then has retry logic built in. And if it passes success, it doesn’t retry that notebook. If you wanna copy it down, you can just pause this video and copy the function down. I think I modified it from something I found on Google anyways, so I can’t claim any credit for this, but it’s definitely made huge improvements to the overall reliability of the pipeline. We haven’t seen a single pipeline failure, like an overall pipeline failure since we started using this automatic retry. And even though individual tasks have failed occasionally to do that MLflow bug, the whole pipeline always succeeds.
And you can see here how it works in practice, that failed run has preserved the internal errands for debugging after the fact. I can even click into the notebook and see exactly where it failed and see the error messages in context. But the second job was automatically spawned and it completed successfully without intervention.
And regardless of whether it runs or fails, all of the output from all of the runs are preserved, which means I can use print debugging all over the place in my notebooks and then leave it there in production because Databricks persists all of those logs. They’re all kept alongside of their context. I can see things like printed tables to make sure that the table had the right data in it when that ran.
The other thing you might’ve noticed is that I’m using widgets as notebook arguments. This lets me pass in either an auto string, which my notebook resolves to be the last day of the month or a specific forecasting date, which allows for easy backfilling.
This is how it works when it’s scheduled to run every month, which has passed in the argument auto, but we can also run really big backfill jobs just by passing in specific days in the past where we want to run a forecast for the data up until those days. In essence, we can produce the forecasts that the pipeline would have produced based on what it knew at the time, which helps us to do offline analysis, to determine whether the pipeline is generating useful forecasts.
So let’s take a quick review and make sure that we’ve achieved all of the objectives from this pipeline.
We’re gonna take a look at this from the point of view of someone who’s just standing in the team and has been tasked with taking ownership of the pipeline. And I say that because it’s not really a hypothetical, I have moved into a new role since I built this and someone else did have to take over my pipeline. So this is a great test for reproducibility. And it’s an excellent lens, I think. When you’re looking at reproducibility, think of it through the lens of someone who’s just starting in your team and has to take it over. So this new team member joins the team and they’re asked to figure out what Perry did with that forecasting task. The new team member can see from the scheduler that the pipeline is calling three notebooks in order, and they can look inside those notebooks and see that we’re using Delta tables to pass information between the first two, and then we’re using MLflow to pass information between the training and the scoring step.
And this is the first page they see, when they click into the task at the top or the runs at the bottom, it guides the reader towards what they need to know as far as pipeline handovers go, I think this is pretty good. It’s one of the great advantages of using a platform tool like Databricks or MLflow. They make it so much easier to discover work across your business and make it much easier to decouple the processes from the people that create them.
From user perspective, it’s really easy to discover the pipeline as well. So not the maintainer, but the user who turns up and discovers your table because every row in the forecast table has a URL, which guides you towards the MLflow record that generated that row. The MLflow record gives you a link to the notebook in the pipeline, as well as to the model binary, the table name and the table version, which in turn gives you access to the version data set that was used to train the model. So in the end, it’s not just reproducible. It’s also inherently discoverable, which is a real asset for downstream consumers. And especially for those downstream undeclared consumers who stumble across your tables and start using them to some business critical process without ever speaking to you. They can discover how to find you, they can discover how to find the pipeline, they can discover how to debug for themselves, even if they’ve never told you that they’re a consumer of your a table.
And you can see here what that looks like. If someone was to stumble across this table, it’s immediately clear where to go if you have questions about where the data came from.
So once again, I’m never ashamed to reuse a name, and I think this pipeline has been a great success. We’ve been running it for six months now without any human intervention, and it’s producing consistently high quality stable forecasts every month. If we hear anything that suggests that the pipeline is starting to drift in future, it’s a fairly painless process to extend our usage of MLflow, to start tracking performance metrics or adding debugging clots or whatever we need. And I think that’s the key takeaway from this talk. If every project looks like a full blown engineering project, it’s hard for businesses to take a risk on new initiatives, but by empowering data scientists to explore and build and deploy their own complex solutions using a managed platform, I was able to quickly build a forecasting pipeline. That was good enough for the needs of the business without engaging any external engineering help.
And that’s it for me today. Thank you for listening to my talk. I’m looking forward to catching up with all of you throughout the rest of the summer.
Perry is a Senior Data Scientist with Atlassian, using the latest machine learning algorithms to help Atlassian's Support teams scale effectively whilst maintaining the legendary service Atlassian is known for. Prior to joining the Atlassian team, Perry was building artificial intelligence solutions to help improve the financial wellbeing of over 10 million customers of the Commonwealth Bank of Australia. Perry is also a lecturer at the University of Technology Sydney, teaching the Data Science Programming course as part of the Master of Data Science and Innovation program.