CLV Part 1: Customer Lifetimes(Python)
Loading...

Calculating the Probability of Future Customer Engagement

In non-subscription retail models, customers come and go with no long-term commitments, making it very difficult to determine whether a customer will return in the future. Determining the probability that a customer will re-engage is critical to the design of effective marketing campaigns. Different messaging and promotions may be required to incentivize customers who have likely dropped out to return to our stores. Engaged customers may be more responsive to marketing that encourages them to expand the breadth and scale of purchases with us. Understanding where our customers land with regard to the probability of future engagement is critical to tailoring our marketing efforts to them.

The Buy 'til You Die (BTYD) models popularized by Peter Fader and others leverage two basic customer metrics, i.e. the recency of a customer's last engagement and the frequency of repeat transactions over a customer's lifetime, to derive a probability of future re-engagement. This is done by fitting customer history to curves describing the distribution of purchase frequencies and engagement drop-off following a prior purchase. The math behind these models is fairly complex but thankfully it's been encapsulated in the lifetimes library, making it much easier for traditional enterprises to employ. The purpose of this notebook is to examine how these models may be applied to customer transaction history and how they may be deployed for integration in marketing processes.

Step 1: Setup the Environment

To run this notebook, you need to attach to a Databricks ML Runtime cluster leveraging Databricks version 6.5+. This version of the Databricks runtime will provide access to many of the pre-configured libraries used here. Still, there are additional Python libraries which you will need to install and attach to your cluster. These are:

  • xlrd
  • lifetimes==0.10.1
  • nbconvert

To install these libraries in your Databricks workspace, please follow these steps using the PyPI library source in combination with the bullet-pointed library names in the provided list. Once installed, please be sure to attach these libraries to the cluster with which you are running this notebook.

With the libraries installed, let's load a sample dataset with which we can examine the BTYD models. The dataset we will use is the Online Retail Data Set available from the UCI Machine Learning Repository. This dataset is made available as a Microsoft Excel workbook (XLSX). Having downloaded this XLSX file to our local system, we can load it into our Databricks environment by following the steps provided here. Please note when performing the file import, you don't need to select the Create Table with UI or the Create Table in Notebook options to complete the import process. Also, the name of the XLSX file will be modified upon import as it includes an unsupported space character. As a result, we will need to programmatically locate the new name for the file assigned by the import process.

Assuming we've uploaded the XLSX to the /FileStore/tables/online_retail/, we can access it as follows:

import pandas as pd
import numpy as np

# identify name of xlsx file (which will change when uploaded)
xlsx_filename = dbutils.fs.ls('file:///dbfs/FileStore/tables/online_retail')[0][0]

# schema of the excel spreadsheet data range
orders_schema = {
  'InvoiceNo':np.str,
  'StockCode':np.str,
  'Description':np.str,
  'Quantity':np.int64,
  'InvoiceDate':np.datetime64,
  'UnitPrice':np.float64,
  'CustomerID':np.str,
  'Country':np.str  
  }

# read spreadsheet to pandas dataframe
# the xlrd library must be installed for this step to work 
orders_pd = pd.read_excel(
  xlsx_filename, 
  sheet_name='Online Retail',
  header=0, # first row is header
  dtype=orders_schema
  )

# display first few rows from the dataset
orders_pd.head(10)
Out[1]:

The data in the workbook are organized as a range in the Online Retail spreadsheet. Each record represents a line item in a sales transaction. The fields included in the dataset are:

Field Description
InvoiceNo A 6-digit integral number uniquely assigned to each transaction
StockCode A 5-digit integral number uniquely assigned to each distinct product
Description The product (item) name
Quantity The quantities of each product (item) per transaction
InvoiceDate The invoice date and a time in mm/dd/yy hh:mm format
UnitPrice The per-unit product price in pound sterling (£)
CustomerID A 5-digit integral number uniquely assigned to each customer
Country The name of the country where each customer resides

Of these fields, the ones of particular interest for our work are InvoiceNo which identifies the transaction, InvoiceDate which identifies the date of that transaction, and CustomerID which uniquely identifies the customer across multiple transactions. (In a separate notebook, we will examine the monetary value of the transactions through the UnitPrice and Quantity fields.)

Step 2: Explore the Dataset

To enable the exploration of the data using SQL statements, let's flip the pandas DataFrame into a Spark DataFrame and persist it as a temporary view:

# convert pandas DF to Spark DF
orders = spark.createDataFrame(orders_pd)

# present Spark DF as queriable view
orders.createOrReplaceTempView('orders') 

Examining the transaction activity in our dataset, we can see the first transaction occurs December 1, 2010 and the last is on December 9, 2011 making this a dataset that's a little more than 1 year in duration. The daily transaction count shows there is quite a bit of volatility in daily activity for this online retailer:

%sql -- unique transactions by date

SELECT 
  TO_DATE(InvoiceDate) as InvoiceDate,
  COUNT(DISTINCT InvoiceNo) as Transactions
FROM orders
GROUP BY TO_DATE(InvoiceDate)
ORDER BY InvoiceDate;
InvoiceDateJanJul201150100150200TransactionsTransactions

We can smooth this out a bit by summarizing activity by month. It's important to keep in mind that December 2011 only consists of 9 days so the sales decline graphed for the last month should most likely be ignored:

NOTE We will hide the SQL behind each of the following result sets for ease of viewing. To view this code, simply click the Show code item above each of the following charts.

Show code
InvoiceMonthJanJul20111.5k2.0k2.5k3.0kTransactionsTransactions

For the little more than 1-year period for which we have data, we see over four-thousand unique customers. These customers generated about twenty-two thousand unique transactions:

Show code
437222190

A little quick math may lead us to estimate that, on average, each customer is responsible for about 5 transactions, but this would not provide an accurate representation of customer activity.

Instead, if we count the unique transactions by customer and then examine the frequency of these values, we see that many of the customers have engaged in a single transaction. The distribution of the count of repeat purchases declines from there in a manner that we may describe as negative binomial distribution (which is the basis of the NBD acronym included in the name of most BTYD models):

Show code
0.002004006008001.0k1.2k1.4k1357911131517192123252729313335384043465052546063757786118169248TOOLTIPTransactionsOccurrences

If we alter our last analysis to group a customer's transactions that occur on the same date into a single transaction - a pattern that aligns with metrics we will calculate later - we see that a few more customers are identified as non-repeat customers but the overall pattern remains the same:

Show code
-2000.002004006008001.0k1.2k1.4k1.6k1357911131517192123252729313335373941454753647289113146TOOLTIPTransactionsOccurances

Focusing on customers with repeat purchases, we can examine the distribution of the days between purchase events. What's important to note here is that most customers return to the site within 2 to 3 months of a prior purchase. Longer gaps do occur but significantly fewer customers have longer gaps between returns. This is important to understand in the context of our BYTD models in that the time since we last saw a customer is a critical factor to determining whether they will ever come back with the probability of return dropping as more and more time passes since a customer's last purchase event:

Show code
0.000.010.020.030.040.050.060.070.080.090.100.0020406080100120140160180200220240260280300320340360AvgDaysBetweenDensity

Aggregated (by count) in the backend.

Step 3: Calculate Customer Metrics

The dataset with which we are working consists of raw transactional history. To apply the BTYD models, we need to derive several per-customer metrics:

  • Frequency - the number of dates on which a customer made a purchase subsequent to the date of the customer's first purchase
  • Age (T) - the number of time units, e.g. days, since the date of a customer's first purchase to the current date (or last date in the dataset)
  • Recency - the age of the customer (as previously defined) at the time of their last purchase

It's important to note that when calculating metrics such as customer age that we need to consider when our dataset terminates. Calculating these metrics relative to today's date can lead to erroneous results. Given this, we will identify the last date in the dataset and define that as today's date for all calculations.

To get started with these calculations, let's take a look at how they are performed using the built-in functionality of the lifetimes library:

import lifetimes

# set the last transaction date as the end point for this historical dataset
current_date = orders_pd['InvoiceDate'].max()

# calculate the required customer metrics
metrics_pd = (
  lifetimes.utils.summary_data_from_transaction_data(
    orders_pd,
    customer_id_col='CustomerID',
    datetime_col='InvoiceDate',
    observation_period_end = current_date, 
    freq='D'
    )
  )

# display first few rows
metrics_pd.head(10)
Out[3]:

The lifetimes library, like many Python libraries, is single-threaded. Using this library to derive customer metrics on larger transactional datasets may overwhelm your system or simply take too long to complete. For this reason, let's examine how these metrics can be calculated using the distributed capabilities of Apache Spark.

As SQL is frequency employed for complex data manipulation, we'll start with a Spark SQL statement. In this statement, we first assemble each customer's order history consisting of the customer's ID, the date of their first purchase (first_at), the date on which a purchase was observed (transaction_at) and the current date (using the last date in the dataset for this value). From this history, we can count the number of repeat transaction dates (frequency), the days between the last and first transaction dates (recency), and the days between the current date and first transaction (T) on a per-customer basis:

# sql statement to derive summary customer stats
sql = '''
  SELECT
    a.customerid as CustomerID,
    CAST(COUNT(DISTINCT a.transaction_at) - 1 as float) as frequency,
    CAST(DATEDIFF(MAX(a.transaction_at), a.first_at) as float) as recency,
    CAST(DATEDIFF(a.current_dt, a.first_at) as float) as T
  FROM ( -- customer order history
    SELECT DISTINCT
      x.customerid,
      z.first_at,
      TO_DATE(x.invoicedate) as transaction_at,
      y.current_dt
    FROM orders x
    CROSS JOIN (SELECT MAX(TO_DATE(invoicedate)) as current_dt FROM orders) y                                -- current date (according to dataset)
    INNER JOIN (SELECT customerid, MIN(TO_DATE(invoicedate)) as first_at FROM orders GROUP BY customerid) z  -- first order per customer
      ON x.customerid=z.customerid
    WHERE x.customerid IS NOT NULL
    ) a
  GROUP BY a.customerid, a.current_dt, a.first_at
  ORDER BY CustomerID
  '''

# capture stats in dataframe 
metrics_sql = spark.sql(sql)

# display stats
display(metrics_sql)  
1234600325
123476365367
123483283358
123490018
1235000310
123526260296
1235300204
1235400232
1235500214
123562303325
123570033
123581149150
123595324331
123602148200
1236100287
1236212292295
123631133242
123643105112
1236500291
12367004
123703309360
1237111559
123722225296
1237300311
123740025
1237529698
12377139354
1237800129
12379289170
123804164185
123815115119
123835168352
12384293121
12386129366
123885311326
123900079
123910021
123933260332
123941154217
1239513356371
123971100135
123980045
123993142261
1240100303
1240200323
124030049
1240500148
124062161183
124074215264
124086228260
124093104182
1241017308
12412295169
124133271337
12414293310
1241517313337
1241711354357
1241800112
124200063
124213304319
124222229324
124238353353
1242400162
124250078
1242600194
124274360371
124289258283
124293356365
124300043
1243115338373
124322130172
124335373373
124343276360
124351188267
124360099
1243715330331
124381126140
1244100366
12442003
124444150171
124450022
124460057
1244700243
124480044
124493165187
12450117173
124515314324
124520016
1245300134
124541356
124555223296
124563210254
124578181239
124581213284
12461160154
124623301303
124635195241
124646299309
124652162169
124682173316

Showing the first 1000 rows.

Of course, Spark SQL does not require the DataFrame to be accessed exclusively using a SQL statement. We may derive this same result using the Programmatic SQL API which may align better with some Data Scientist's preferences. The code in the next cell is purposely assembled to mirror the structure in the previous SQL statement for the purposes of comparison:

from pyspark.sql.functions import to_date, datediff, max, min, countDistinct, count, sum, when
from pyspark.sql.types import *

# valid customer orders
x = orders.where(orders.CustomerID.isNotNull())

# calculate last date in dataset
y = (
  orders
    .groupBy()
    .agg(max(to_date(orders.InvoiceDate)).alias('current_dt'))
  )

# calculate first transaction date by customer
z = (
  orders
    .groupBy(orders.CustomerID)
    .agg(min(to_date(orders.InvoiceDate)).alias('first_at'))
  )

# combine customer history with date info 
a = (x
    .crossJoin(y)
    .join(z, x.CustomerID==z.CustomerID, how='inner')
    .select(
      x.CustomerID.alias('customerid'), 
      z.first_at, 
      to_date(x.InvoiceDate).alias('transaction_at'), 
      y.current_dt
      )
     .distinct()
    )

# calculate relevant metrics by customer
metrics_api = (a
           .groupBy(a.customerid, a.current_dt, a.first_at)
           .agg(
             (countDistinct(a.transaction_at)-1).cast(FloatType()).alias('frequency'),
             datediff(max(a.transaction_at), a.first_at).cast(FloatType()).alias('recency'),
             datediff(a.current_dt, a.first_at).cast(FloatType()).alias('T')
             )
           .select('customerid','frequency','recency','T')
           .orderBy('customerid')
          )

display(metrics_api)
1234600325
123476365367
123483283358
123490018
1235000310
123526260296
1235300204
1235400232
1235500214
123562303325
123570033
123581149150
123595324331
123602148200
1236100287
1236212292295
123631133242
123643105112
1236500291
12367004
123703309360
1237111559
123722225296
1237300311
123740025
1237529698
12377139354
1237800129
12379289170
123804164185
123815115119
123835168352
12384293121
12386129366
123885311326
123900079
123910021
123933260332
123941154217
1239513356371
123971100135
123980045
123993142261
1240100303
1240200323
124030049
1240500148
124062161183
124074215264
124086228260
124093104182
1241017308
12412295169
124133271337
12414293310
1241517313337
1241711354357
1241800112
124200063
124213304319
124222229324
124238353353
1242400162
124250078
1242600194
124274360371
124289258283
124293356365
124300043
1243115338373
124322130172
124335373373
124343276360
124351188267
124360099
1243715330331
124381126140
1244100366
12442003
124444150171
124450022
124460057
1244700243
124480044
124493165187
12450117173
124515314324
124520016
1245300134
124541356
124555223296
124563210254
124578181239
124581213284
12461160154
124623301303
124635195241
124646299309
124652162169
124682173316

Showing the first 1000 rows.