CLV Part 1: Customer Lifetimes

Calculating the Probability of Future Customer Engagement

In non-subscription retail models, customers come and go with no long-term commitments, making it very difficult to determine whether a customer will return in the future. Determining the probability that a customer will re-engage is critical to the design of effective marketing campaigns. Different messaging and promotions may be required to incentivize customers who have likely dropped out to return to our stores. Engaged customers may be more responsive to marketing that encourages them to expand the breadth and scale of purchases with us. Understanding where our customers land with regard to the probability of future engagement is critical to tailoring our marketing efforts to them.

The Buy 'til You Die (BTYD) models popularized by Peter Fader and others leverage two basic customer metrics, i.e. the recency of a customer's last engagement and the frequency of repeat transactions over a customer's lifetime, to derive a probability of future re-engagement. This is done by fitting customer history to curves describing the distribution of purchase frequencies and engagement drop-off following a prior purchase. The math behind these models is fairly complex but thankfully it's been encapsulated in the lifetimes library, making it much easier for traditional enterprises to employ. The purpose of this notebook is to examine how these models may be applied to customer transaction history and how they may be deployed for integration in marketing processes.

Step 1: Setup the Environment

To run this notebook, you need to attach to a Databricks ML Runtime cluster leveraging Databricks version 6.5+. This version of the Databricks runtime will provide access to many of the pre-configured libraries used here. Still, there are additional Python libraries which you will need to install and attach to your cluster. These are:

xlrd
lifetimes==0.10.1
nbconvert

To install these libraries in your Databricks workspace, please follow these steps using the PyPI library source in combination with the bullet-pointed library names in the provided list. Once installed, please be sure to attach these libraries to the cluster with which you are running this notebook.

With the libraries installed, let's load a sample dataset with which we can examine the BTYD models. The dataset we will use is the Online Retail Data Set available from the UCI Machine Learning Repository. This dataset is made available as a Microsoft Excel workbook (XLSX). Having downloaded this XLSX file to our local system, we can load it into our Databricks environment by following the steps provided here. Please note when performing the file import, you don't need to select the Create Table with UI or the Create Table in Notebook options to complete the import process. Also, the name of the XLSX file will be modified upon import as it includes an unsupported space character. As a result, we will need to programmatically locate the new name for the file assigned by the import process.

Assuming we've uploaded the XLSX to the /FileStore/tables/online_retail/, we can access it as follows:

import pandas as pd
import numpy as np

# identify name of xlsx file (which will change when uploaded)
xlsx_filename = dbutils.fs.ls('file:///dbfs/FileStore/tables/online_retail')[0][0]

# schema of the excel spreadsheet data range
orders_schema = {
  'InvoiceNo':np.str,
  'StockCode':np.str,
  'Description':np.str,
  'Quantity':np.int64,
  'InvoiceDate':np.datetime64,
  'UnitPrice':np.float64,
  'CustomerID':np.str,
  'Country':np.str  
  }

# read spreadsheet to pandas dataframe
# the xlrd library must be installed for this step to work 
orders_pd = pd.read_excel(
  xlsx_filename, 
  sheet_name='Online Retail',
  header=0, # first row is header
  dtype=orders_schema
  )

# display first few rows from the dataset
orders_pd.head(10)

Out[1]:

The data in the workbook are organized as a range in the Online Retail spreadsheet. Each record represents a line item in a sales transaction. The fields included in the dataset are:

Field	Description
InvoiceNo	A 6-digit integral number uniquely assigned to each transaction
StockCode	A 5-digit integral number uniquely assigned to each distinct product
Description	The product (item) name
Quantity	The quantities of each product (item) per transaction
InvoiceDate	The invoice date and a time in mm/dd/yy hh:mm format
UnitPrice	The per-unit product price in pound sterling (£)
CustomerID	A 5-digit integral number uniquely assigned to each customer
Country	The name of the country where each customer resides

Of these fields, the ones of particular interest for our work are InvoiceNo which identifies the transaction, InvoiceDate which identifies the date of that transaction, and CustomerID which uniquely identifies the customer across multiple transactions. (In a separate notebook, we will examine the monetary value of the transactions through the UnitPrice and Quantity fields.)

# convert pandas DF to Spark DF
orders = spark.createDataFrame(orders_pd)

# present Spark DF as queriable view
orders.createOrReplaceTempView('orders')

%sql -- unique transactions by date

SELECT 
  TO_DATE(InvoiceDate) as InvoiceDate,
  COUNT(DISTINCT InvoiceNo) as Transactions
FROM orders
GROUP BY TO_DATE(InvoiceDate)
ORDER BY InvoiceDate;

Show code


4372	22190

A little quick math may lead us to estimate that, on average, each customer is responsible for about 5 transactions, but this would not provide an accurate representation of customer activity.

Instead, if we count the unique transactions by customer and then examine the frequency of these values, we see that many of the customers have engaged in a single transaction. The distribution of the count of repeat purchases declines from there in a manner that we may describe as negative binomial distribution (which is the basis of the NBD acronym included in the name of most BTYD models):

Show code

Focusing on customers with repeat purchases, we can examine the distribution of the days between purchase events. What's important to note here is that most customers return to the site within 2 to 3 months of a prior purchase. Longer gaps do occur but significantly fewer customers have longer gaps between returns. This is important to understand in the context of our BYTD models in that the time since we last saw a customer is a critical factor to determining whether they will ever come back with the probability of return dropping as more and more time passes since a customer's last purchase event:

Show code

Step 3: Calculate Customer Metrics

The dataset with which we are working consists of raw transactional history. To apply the BTYD models, we need to derive several per-customer metrics:

Frequency - the number of dates on which a customer made a purchase subsequent to the date of the customer's first purchase
Age (T) - the number of time units, e.g. days, since the date of a customer's first purchase to the current date (or last date in the dataset)
Recency - the age of the customer (as previously defined) at the time of their last purchase

It's important to note that when calculating metrics such as customer age that we need to consider when our dataset terminates. Calculating these metrics relative to today's date can lead to erroneous results. Given this, we will identify the last date in the dataset and define that as today's date for all calculations.

To get started with these calculations, let's take a look at how they are performed using the built-in functionality of the lifetimes library:

import lifetimes

# set the last transaction date as the end point for this historical dataset
current_date = orders_pd['InvoiceDate'].max()

# calculate the required customer metrics
metrics_pd = (
  lifetimes.utils.summary_data_from_transaction_data(
    orders_pd,
    customer_id_col='CustomerID',
    datetime_col='InvoiceDate',
    observation_period_end = current_date, 
    freq='D'
    )
  )

# display first few rows
metrics_pd.head(10)

Out[3]:

The lifetimes library, like many Python libraries, is single-threaded. Using this library to derive customer metrics on larger transactional datasets may overwhelm your system or simply take too long to complete. For this reason, let's examine how these metrics can be calculated using the distributed capabilities of Apache Spark.

As SQL is frequency employed for complex data manipulation, we'll start with a Spark SQL statement. In this statement, we first assemble each customer's order history consisting of the customer's ID, the date of their first purchase (first_at), the date on which a purchase was observed (transaction_at) and the current date (using the last date in the dataset for this value). From this history, we can count the number of repeat transaction dates (frequency), the days between the last and first transaction dates (recency), and the days between the current date and first transaction (T) on a per-customer basis:

# sql statement to derive summary customer stats
sql = '''
  SELECT
    a.customerid as CustomerID,
    CAST(COUNT(DISTINCT a.transaction_at) - 1 as float) as frequency,
    CAST(DATEDIFF(MAX(a.transaction_at), a.first_at) as float) as recency,
    CAST(DATEDIFF(a.current_dt, a.first_at) as float) as T
  FROM ( -- customer order history
    SELECT DISTINCT
      x.customerid,
      z.first_at,
      TO_DATE(x.invoicedate) as transaction_at,
      y.current_dt
    FROM orders x
    CROSS JOIN (SELECT MAX(TO_DATE(invoicedate)) as current_dt FROM orders) y                                -- current date (according to dataset)
    INNER JOIN (SELECT customerid, MIN(TO_DATE(invoicedate)) as first_at FROM orders GROUP BY customerid) z  -- first order per customer
      ON x.customerid=z.customerid
    WHERE x.customerid IS NOT NULL
    ) a
  GROUP BY a.customerid, a.current_dt, a.first_at
  ORDER BY CustomerID
  '''

# capture stats in dataframe 
metrics_sql = spark.sql(sql)

# display stats
display(metrics_sql)


12346	0	0	325
12347	6	365	367
12348	3	283	358
12349	0	0	18
12350	0	0	310
12352	6	260	296
12353	0	0	204
12354	0	0	232
12355	0	0	214
12356	2	303	325
12357	0	0	33
12358	1	149	150
12359	5	324	331
12360	2	148	200
12361	0	0	287
12362	12	292	295
12363	1	133	242
12364	3	105	112
12365	0	0	291
12367	0	0	4
12370	3	309	360
12371	1	15	59
12372	2	225	296
12373	0	0	311
12374	0	0	25
12375	2	96	98
12377	1	39	354
12378	0	0	129
12379	2	89	170
12380	4	164	185
12381	5	115	119
12383	5	168	352
12384	2	93	121
12386	1	29	366
12388	5	311	326
12390	0	0	79
12391	0	0	21
12393	3	260	332
12394	1	154	217
12395	13	356	371
12397	1	100	135
12398	0	0	45
12399	3	142	261
12401	0	0	303
12402	0	0	323
12403	0	0	49
12405	0	0	148
12406	2	161	183
12407	4	215	264
12408	6	228	260
12409	3	104	182
12410	1	7	308
12412	2	95	169
12413	3	271	337
12414	2	93	310
12415	17	313	337
12417	11	354	357
12418	0	0	112
12420	0	0	63
12421	3	304	319
12422	2	229	324
12423	8	353	353
12424	0	0	162
12425	0	0	78
12426	0	0	194
12427	4	360	371
12428	9	258	283
12429	3	356	365
12430	0	0	43
12431	15	338	373
12432	2	130	172
12433	5	373	373
12434	3	276	360
12435	1	188	267
12436	0	0	99
12437	15	330	331
12438	1	126	140
12441	0	0	366
12442	0	0	3
12444	4	150	171
12445	0	0	22
12446	0	0	57
12447	0	0	243
12448	0	0	44
12449	3	165	187
12450	1	17	173
12451	5	314	324
12452	0	0	16
12453	0	0	134
12454	1	3	56
12455	5	223	296
12456	3	210	254
12457	8	181	239
12458	1	213	284
12461	1	60	154
12462	3	301	303
12463	5	195	241
12464	6	299	309
12465	2	162	169
12468	2	173	316

Showing the first 1000 rows.

from pyspark.sql.functions import to_date, datediff, max, min, countDistinct, count, sum, when
from pyspark.sql.types import *

# valid customer orders
x = orders.where(orders.CustomerID.isNotNull())

# calculate last date in dataset
y = (
  orders
    .groupBy()
    .agg(max(to_date(orders.InvoiceDate)).alias('current_dt'))
  )

# calculate first transaction date by customer
z = (
  orders
    .groupBy(orders.CustomerID)
    .agg(min(to_date(orders.InvoiceDate)).alias('first_at'))
  )

# combine customer history with date info 
a = (x
    .crossJoin(y)
    .join(z, x.CustomerID==z.CustomerID, how='inner')
    .select(
      x.CustomerID.alias('customerid'), 
      z.first_at, 
      to_date(x.InvoiceDate).alias('transaction_at'), 
      y.current_dt
      )
     .distinct()
    )

# calculate relevant metrics by customer
metrics_api = (a
           .groupBy(a.customerid, a.current_dt, a.first_at)
           .agg(
             (countDistinct(a.transaction_at)-1).cast(FloatType()).alias('frequency'),
             datediff(max(a.transaction_at), a.first_at).cast(FloatType()).alias('recency'),
             datediff(a.current_dt, a.first_at).cast(FloatType()).alias('T')
             )
           .select('customerid','frequency','recency','T')
           .orderBy('customerid')
          )

display(metrics_api)


12346	0	0	325
12347	6	365	367
12348	3	283	358
12349	0	0	18
12350	0	0	310
12352	6	260	296
12353	0	0	204
12354	0	0	232
12355	0	0	214
12356	2	303	325
12357	0	0	33
12358	1	149	150
12359	5	324	331
12360	2	148	200
12361	0	0	287
12362	12	292	295
12363	1	133	242
12364	3	105	112
12365	0	0	291
12367	0	0	4
12370	3	309	360
12371	1	15	59
12372	2	225	296
12373	0	0	311
12374	0	0	25
12375	2	96	98
12377	1	39	354
12378	0	0	129
12379	2	89	170
12380	4	164	185
12381	5	115	119
12383	5	168	352
12384	2	93	121
12386	1	29	366
12388	5	311	326
12390	0	0	79
12391	0	0	21
12393	3	260	332
12394	1	154	217
12395	13	356	371
12397	1	100	135
12398	0	0	45
12399	3	142	261
12401	0	0	303
12402	0	0	323
12403	0	0	49
12405	0	0	148
12406	2	161	183
12407	4	215	264
12408	6	228	260
12409	3	104	182
12410	1	7	308
12412	2	95	169
12413	3	271	337
12414	2	93	310
12415	17	313	337
12417	11	354	357
12418	0	0	112
12420	0	0	63
12421	3	304	319
12422	2	229	324
12423	8	353	353
12424	0	0	162
12425	0	0	78
12426	0	0	194
12427	4	360	371
12428	9	258	283
12429	3	356	365
12430	0	0	43
12431	15	338	373
12432	2	130	172
12433	5	373	373
12434	3	276	360
12435	1	188	267
12436	0	0	99
12437	15	330	331
12438	1	126	140
12441	0	0	366
12442	0	0	3
12444	4	150	171
12445	0	0	22
12446	0	0	57
12447	0	0	243
12448	0	0	44
12449	3	165	187
12450	1	17	173
12451	5	314	324
12452	0	0	16
12453	0	0	134
12454	1	3	56
12455	5	223	296
12456	3	210	254
12457	8	181	239
12458	1	213	284
12461	1	60	154
12462	3	301	303
12463	5	195	241
12464	6	299	309
12465	2	162	169
12468	2	173	316

Showing the first 1000 rows.

Let's take a moment to compare the data in these different metrics datasets, just to confirm the results are identical. Instead of doing this record by record, let's calculate summary statistics across each dataset to verify their consistency:

NOTE You may notice means and standard deviations vary slightly in the hundred-thousandths and millionths decimal places. This is a result of slight differences in data types between the pandas and Spark DataFrames but do not affect our results in a meaningful way.

# summary data from lifetimes
metrics_pd.describe()

Out[6]:

# summary data from SQL statement
metrics_sql.toPandas().describe()

Out[7]:

# summary data from pyspark.sql API
metrics_api.toPandas().describe()

Out[8]:

The metrics we've calculated represent summaries of a time series of data. To support model validation and avoid overfitting, a common pattern with time series data is to train models on an earlier portion of the time series (known as the calibration period) and validate against a later portion of the time series (known as the holdout period). In the lifetimes library, the derivation of per customer metrics using calibration and holdout periods is done through a simple method call. Because our dataset consists of a limited range for data, we will instruct this library method to use the last 90-days of data as the holdout period. A simple parameter called a widget on the Databricks platform has been implemented to make the configuration of this setting easily changeable:

NOTE To change the number of days in the holdout period, look for the textbox widget by scrolling to the top of your Databricks notebook after running this next cell

# define a notebook parameter making holdout days configurable (90-days default)
dbutils.widgets.text('holdout days', '90')

from datetime import timedelta

# set the last transaction date as the end point for this historical dataset
current_date = orders_pd['InvoiceDate'].max()

# define end of calibration period
holdout_days = int(dbutils.widgets.get('holdout days'))
calibration_end_date = current_date - timedelta(days = holdout_days)

# calculate the required customer metrics
metrics_cal_pd = (
  lifetimes.utils.calibration_and_holdout_data(
    orders_pd,
    customer_id_col='CustomerID',
    datetime_col='InvoiceDate',
    observation_period_end = current_date,
    calibration_period_end=calibration_end_date,
    freq='D'    
    )
  )

# display first few rows
metrics_cal_pd.head(10)

Out[10]:

As before, we may leverage Spark SQL to derive this same information. Again, we'll examine this through both a SQL statement and the programmatic SQL API.

To understand the SQL statement, first recognize that it's divided into two main parts. In the first, we calculate the core metrics, i.e. recency, frequency and age (T), per customer for the calibration period, much like we did in the previous query example. In the second part of the query, we calculate the number of purchase dates in the holdout customer for each customer. This value (frequency_holdout) represents the incremental value to be added to the frequency for the calibration period (frequency_cal) when we examine a customer's entire transaction history across both calibration and holdout periods.

To simplify our logic, a common table expression (CTE) named CustomerHistory is defined at the top of the query. This query extracts the relevant dates that make up a customer's transaction history and closely mirrors the logic at the center of the last SQL statement we examined. The only difference is that we include the number of days in the holdout period (duration_holdout):

sql = '''
WITH CustomerHistory 
  AS (
    SELECT  -- nesting req'ed b/c can't SELECT DISTINCT on widget parameter
      m.*,
      getArgument('holdout days') as duration_holdout
    FROM (
      SELECT DISTINCT
        x.customerid,
        z.first_at,
        TO_DATE(x.invoicedate) as transaction_at,
        y.current_dt
      FROM orders x
      CROSS JOIN (SELECT MAX(TO_DATE(invoicedate)) as current_dt FROM orders) y                                -- current date (according to dataset)
      INNER JOIN (SELECT customerid, MIN(TO_DATE(invoicedate)) as first_at FROM orders GROUP BY customerid) z  -- first order per customer
        ON x.customerid=z.customerid
      WHERE x.customerid IS NOT NULL
    ) m
  )
SELECT
    a.customerid as CustomerID,
    a.frequency as frequency_cal,
    a.recency as recency_cal,
    a.T as T_cal,
    COALESCE(b.frequency_holdout, 0.0) as frequency_holdout,
    a.duration_holdout
FROM ( -- CALIBRATION PERIOD CALCULATIONS
    SELECT
        p.customerid,
        CAST(p.duration_holdout as float) as duration_holdout,
        CAST(DATEDIFF(MAX(p.transaction_at), p.first_at) as float) as recency,
        CAST(COUNT(DISTINCT p.transaction_at) - 1 as float) as frequency,
        CAST(DATEDIFF(DATE_SUB(p.current_dt, p.duration_holdout), p.first_at) as float) as T
    FROM CustomerHistory p
    WHERE p.transaction_at < DATE_SUB(p.current_dt, p.duration_holdout)  -- LIMIT THIS QUERY TO DATA IN THE CALIBRATION PERIOD
    GROUP BY p.customerid, p.first_at, p.current_dt, p.duration_holdout
  ) a
LEFT OUTER JOIN ( -- HOLDOUT PERIOD CALCULATIONS
  SELECT
    p.customerid,
    CAST(COUNT(DISTINCT p.transaction_at) as float) as frequency_holdout
  FROM CustomerHistory p
  WHERE 
    p.transaction_at >= DATE_SUB(p.current_dt, p.duration_holdout) AND  -- LIMIT THIS QUERY TO DATA IN THE HOLDOUT PERIOD
    p.transaction_at <= p.current_dt
  GROUP BY p.customerid
  ) b
  ON a.customerid=b.customerid
ORDER BY CustomerID
'''

metrics_cal_sql = spark.sql(sql)
display(metrics_cal_sql)


12346	0	0	235	0	90
12347	4	238	277	2	90
12348	2	110	268	1	90
12350	0	0	220	0	90
12352	3	34	206	3	90
12353	0	0	114	0	90
12354	0	0	142	0	90
12355	0	0	124	0	90
12356	1	80	235	1	90
12358	0	0	60	1	90
12359	3	142	241	2	90
12360	1	88	110	1	90
12361	0	0	197	0	90
12362	5	183	205	7	90
12363	1	133	152	0	90
12364	0	0	22	3	90
12365	0	0	201	0	90
12370	2	86	270	1	90
12372	1	84	206	1	90
12373	0	0	221	0	90
12375	0	0	8	2	90
12377	1	39	264	0	90
12378	0	0	39	0	90
12379	1	15	80	1	90
12380	0	0	95	4	90
12381	1	19	29	4	90
12383	5	168	262	0	90
12384	0	0	31	2	90
12386	1	29	276	0	90
12388	3	178	236	2	90
12393	2	87	242	1	90
12394	0	0	127	1	90
12395	9	259	281	4	90
12397	0	0	45	1	90
12399	3	142	171	0	90
12401	0	0	213	0	90
12402	0	0	233	0	90
12405	0	0	58	0	90
12406	1	61	93	1	90
12407	3	130	174	1	90
12408	5	162	170	1	90
12409	1	65	92	2	90
12410	1	7	218	0	90
12412	1	32	79	1	90
12413	2	101	247	1	90
12414	2	93	220	0	90
12415	12	238	247	5	90
12417	8	251	267	3	90
12418	0	0	22	0	90
12421	2	205	229	1	90
12422	2	229	234	0	90
12423	7	255	263	1	90
12424	0	0	72	0	90
12426	0	0	104	0	90
12427	1	20	281	3	90
12428	8	175	193	1	90
12429	2	193	275	1	90
12431	12	254	283	3	90
12432	1	2	82	1	90
12433	2	282	283	3	90
12434	2	111	270	1	90
12435	0	0	177	1	90
12436	0	0	9	0	90
12437	9	233	241	6	90
12438	0	0	50	1	90
12441	0	0	276	0	90
12444	1	50	81	3	90
12447	0	0	153	0	90
12449	2	96	97	1	90
12450	1	17	83	0	90
12451	2	177	234	3	90
12453	0	0	44	0	90
12455	4	169	206	1	90
12456	2	146	164	1	90
12457	3	110	149	5	90
12458	0	0	194	1	90
12461	1	60	64	0	90
12462	1	9	213	2	90
12463	3	66	151	2	90
12464	4	203	219	2	90
12465	0	0	79	2	90
12468	2	173	226	0	90
12471	26	280	282	12	90
12472	8	235	283	4	90
12473	2	77	101	2	90
12474	16	270	278	9	90
12476	8	234	271	5	90
12477	5	220	221	3	90
12480	2	193	246	1	90
12481	6	186	275	2	90
12483	7	241	249	3	90
12484	4	236	246	3	90
12489	0	0	246	0	90
12490	5	236	239	4	90
12492	0	0	15	1	90
12493	2	32	107	0	90
12494	7	208	275	3	90
12500	5	227	243	5	90
12501	1	21	246	0	90
12502	4	214	219	0	90

Showing the first 1000 rows.

from pyspark.sql.functions import avg, date_sub, coalesce, lit, expr

# valid customer orders
x = orders.where(orders.CustomerID.isNotNull())

# calculate last date in dataset
y = (
  orders
    .groupBy()
    .agg(max(to_date(orders.InvoiceDate)).alias('current_dt'))
  )

# calculate first transaction date by customer
z = (
  orders
    .groupBy(orders.CustomerID)
    .agg(min(to_date(orders.InvoiceDate)).alias('first_at'))
  )

# combine customer history with date info (CUSTOMER HISTORY)
p = (x
    .crossJoin(y)
    .join(z, x.CustomerID==z.CustomerID, how='inner')
    .withColumn('duration_holdout', lit(int(dbutils.widgets.get('holdout days'))))
    .select(
      x.CustomerID.alias('customerid'),
      z.first_at, 
      to_date(x.InvoiceDate).alias('transaction_at'), 
      y.current_dt, 
      'duration_holdout'
      )
     .distinct()
    )

# calculate relevant metrics by customer
# note: date_sub requires a single integer value unless employed within an expr() call
a = (p
       .where(p.transaction_at < expr('date_sub(current_dt, duration_holdout)')) 
       .groupBy(p.customerid, p.current_dt, p.duration_holdout, p.first_at)
       .agg(
         (countDistinct(p.transaction_at)-1).cast(FloatType()).alias('frequency_cal'),
         datediff( max(p.transaction_at), p.first_at).cast(FloatType()).alias('recency_cal'),
         datediff( expr('date_sub(current_dt, duration_holdout)'), p.first_at).cast(FloatType()).alias('T_cal')
       )
    )

b = (p
      .where((p.transaction_at >= expr('date_sub(current_dt, duration_holdout)')) & (p.transaction_at <= p.current_dt) )
      .groupBy(p.customerid)
      .agg(
        countDistinct(p.transaction_at).cast(FloatType()).alias('frequency_holdout')
        )
   )

metrics_cal_api = (a
                 .join(b, a.customerid==b.customerid, how='left')
                 .select(
                   a.customerid.alias('CustomerID'),
                   a.frequency_cal,
                   a.recency_cal,
                   a.T_cal,
                   coalesce(b.frequency_holdout, lit(0.0)).alias('frequency_holdout'),
                   a.duration_holdout
                   )
                 .orderBy('CustomerID')
              )

display(metrics_cal_api)


12346	0	0	235	0	90
12347	4	238	277	2	90
12348	2	110	268	1	90
12350	0	0	220	0	90
12352	3	34	206	3	90
12353	0	0	114	0	90
12354	0	0	142	0	90
12355	0	0	124	0	90
12356	1	80	235	1	90
12358	0	0	60	1	90
12359	3	142	241	2	90
12360	1	88	110	1	90
12361	0	0	197	0	90
12362	5	183	205	7	90
12363	1	133	152	0	90
12364	0	0	22	3	90
12365	0	0	201	0	90
12370	2	86	270	1	90
12372	1	84	206	1	90
12373	0	0	221	0	90
12375	0	0	8	2	90
12377	1	39	264	0	90
12378	0	0	39	0	90
12379	1	15	80	1	90
12380	0	0	95	4	90
12381	1	19	29	4	90
12383	5	168	262	0	90
12384	0	0	31	2	90
12386	1	29	276	0	90
12388	3	178	236	2	90
12393	2	87	242	1	90
12394	0	0	127	1	90
12395	9	259	281	4	90
12397	0	0	45	1	90
12399	3	142	171	0	90
12401	0	0	213	0	90
12402	0	0	233	0	90
12405	0	0	58	0	90
12406	1	61	93	1	90
12407	3	130	174	1	90
12408	5	162	170	1	90
12409	1	65	92	2	90
12410	1	7	218	0	90
12412	1	32	79	1	90
12413	2	101	247	1	90
12414	2	93	220	0	90
12415	12	238	247	5	90
12417	8	251	267	3	90
12418	0	0	22	0	90
12421	2	205	229	1	90
12422	2	229	234	0	90
12423	7	255	263	1	90
12424	0	0	72	0	90
12426	0	0	104	0	90
12427	1	20	281	3	90
12428	8	175	193	1	90
12429	2	193	275	1	90
12431	12	254	283	3	90
12432	1	2	82	1	90
12433	2	282	283	3	90
12434	2	111	270	1	90
12435	0	0	177	1	90
12436	0	0	9	0	90
12437	9	233	241	6	90
12438	0	0	50	1	90
12441	0	0	276	0	90
12444	1	50	81	3	90
12447	0	0	153	0	90
12449	2	96	97	1	90
12450	1	17	83	0	90
12451	2	177	234	3	90
12453	0	0	44	0	90
12455	4	169	206	1	90
12456	2	146	164	1	90
12457	3	110	149	5	90
12458	0	0	194	1	90
12461	1	60	64	0	90
12462	1	9	213	2	90
12463	3	66	151	2	90
12464	4	203	219	2	90
12465	0	0	79	2	90
12468	2	173	226	0	90
12471	26	280	282	12	90
12472	8	235	283	4	90
12473	2	77	101	2	90
12474	16	270	278	9	90
12476	8	234	271	5	90
12477	5	220	221	3	90
12480	2	193	246	1	90
12481	6	186	275	2	90
12483	7	241	249	3	90
12484	4	236	246	3	90
12489	0	0	246	0	90
12490	5	236	239	4	90
12492	0	0	15	1	90
12493	2	32	107	0	90
12494	7	208	275	3	90
12500	5	227	243	5	90
12501	1	21	246	0	90
12502	4	214	219	0	90

Showing the first 1000 rows.

# summary data from lifetimes
metrics_cal_pd.describe()

Out[13]:

# summary data from SQL statement
metrics_cal_sql.toPandas().describe()

Out[14]:

# summary data from pyspark.sql API
metrics_cal_api.toPandas().describe()

Out[15]:

Our data prep is nearly done. The last thing we need to do is exclude customers for which we have no repeat purchases, i.e. frequency or frequency_cal is 0. The Pareto/NBD and BG/NBD models we will use focus exclusively on performing calculations on customers with repeat transactions. A modified BG/NBD model, i.e. MBG/NBD, which allows for customers with no repeat transactions is supported by the lifetimes library. However, to stick with the two most popular of the BYTD models in use today, we will limit our data to align with their requirements:

NOTE We are showing how both the pandas and Spark DataFrames are filtered simply to be consistent with side-by-side comparisons earlier in this section of the notebook. In a real-world implementation, you would simply choose to work with pandas or Spark DataFrames for data preparation.

# remove customers with no repeats (complete dataset)
filtered_pd = metrics_pd[metrics_pd['frequency'] > 0]
filtered = metrics_api.where(metrics_api.frequency > 0)

## remove customers with no repeats in calibration period
filtered_cal_pd = metrics_cal_pd[metrics_cal_pd['frequency_cal'] > 0]
filtered_cal = metrics_cal_api.where(metrics_cal_api.frequency_cal > 0)

from lifetimes.fitters.pareto_nbd_fitter import ParetoNBDFitter
from lifetimes.fitters.beta_geo_fitter import BetaGeoFitter

# load spark dataframe to pandas dataframe
input_pd = filtered_cal.toPandas()

# fit a model
model = ParetoNBDFitter(penalizer_coef=0.0)
model.fit( input_pd['frequency_cal'], input_pd['recency_cal'], input_pd['T_cal'])

Out[17]: <lifetimes.ParetoNBDFitter: fitted with 2163 subjects, alpha: 96.96, beta: 3014.97, r: 1.99, s: 0.84>

# get predicted frequency during holdout period
frequency_holdout_predicted = model.predict( input_pd['duration_holdout'], input_pd['frequency_cal'], input_pd['recency_cal'], input_pd['T_cal'])

# get actual frequency during holdout period
frequency_holdout_actual = input_pd['frequency_holdout']

/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py:86: RuntimeWarning: invalid value encountered in reduce return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

import numpy as np

def score_model(actuals, predicted, metric='mse'):
  # make sure metric name is lower case
  metric = metric.lower()
  
  # Mean Squared Error and Root Mean Squared Error
  if metric=='mse' or metric=='rmse':
    val = np.sum(np.square(actuals-predicted))/actuals.shape[0]
    if metric=='rmse':
        val = np.sqrt(val)
  
  # Mean Absolute Error
  elif metric=='mae':
    np.sum(np.abs(actuals-predicted))/actuals.shape[0]
  
  else:
    val = None
  
  return val

# score the model
print('MSE: {0}'.format(score_model(frequency_holdout_actual, frequency_holdout_predicted, 'mse')))

MSE: 3.102822341084317

%md While the internals of the Pareto/NBD model may be quite complex.  In a nutshell, the model calculates a double integral of two curves, one which describes the frequency of customer purchases within a population and another which describes customer survivorship following a prior purchase event. All of the calculation logic is thankfully hidden behind a simple method call.

As simple as training a model may be, we have two models that we could use here: the Pareto/NBD model and the BG/NBD model.  The [BG/NBD model](http://brucehardie.com/papers/018/fader_et_al_mksc_05.pdf) simplifies the math involved in calculating customer lifetime and is the model that popularized the BTYD approach.  Both models work off the same customer features and employ the same constraints.  (The primary difference between the two models is that the BG/NBD model maps the survivorship curve to a beta-geometric distribution instead of a Pareto distribution.) To achieve the best fit possible, it is worthwhile to compare the results of both models with our dataset.

Each model leverages an L2-norm regularization parameter which we've arbitrarily set to 0 in the previous training cycle.  In addition to exploring which model works best, we should consider which value (between 0 and 1) works best for this parameter.  This gives us a pretty broad search space to explore with some hyperparameter tuning.

To assist us with this, we will make use of [hyperopt](http://hyperopt.github.io/hyperopt/).  Hyperopt allows us to parallelize the training and evaluation of models against a hyperparameter search space.  This can be done leveraging the multiprocessor resources of a single machine or across the broader resources provided by a Spark cluster.  With each model iteration, a loss function is calculated.  Using various optimization algorithms, hyperopt navigates the search space to locate the best available combination of parameter settings to minimize the value returned by the loss function.

To make use of hyperopt, lets define our search space and re-write our model training and evaluation logic to provide a single function call which will return a loss function measure:

While the internals of the Pareto/NBD model may be quite complex. In a nutshell, the model calculates a double integral of two curves, one which describes the frequency of customer purchases within a population and another which describes customer survivorship following a prior purchase event. All of the calculation logic is thankfully hidden behind a simple method call.

As simple as training a model may be, we have two models that we could use here: the Pareto/NBD model and the BG/NBD model. The BG/NBD model simplifies the math involved in calculating customer lifetime and is the model that popularized the BTYD approach. Both models work off the same customer features and employ the same constraints. (The primary difference between the two models is that the BG/NBD model maps the survivorship curve to a beta-geometric distribution instead of a Pareto distribution.) To achieve the best fit possible, it is worthwhile to compare the results of both models with our dataset.

Each model leverages an L2-norm regularization parameter which we've arbitrarily set to 0 in the previous training cycle. In addition to exploring which model works best, we should consider which value (between 0 and 1) works best for this parameter. This gives us a pretty broad search space to explore with some hyperparameter tuning.

To assist us with this, we will make use of hyperopt. Hyperopt allows us to parallelize the training and evaluation of models against a hyperparameter search space. This can be done leveraging the multiprocessor resources of a single machine or across the broader resources provided by a Spark cluster. With each model iteration, a loss function is calculated. Using various optimization algorithms, hyperopt navigates the search space to locate the best available combination of parameter settings to minimize the value returned by the loss function.

To make use of hyperopt, lets define our search space and re-write our model training and evaluation logic to provide a single function call which will return a loss function measure:

from hyperopt import hp, fmin, tpe, rand, SparkTrials, STATUS_OK, STATUS_FAIL, space_eval

# define search space
search_space = hp.choice('model_type',[
                  {'type':'Pareto/NBD', 'l2':hp.uniform('pareto_nbd_l2', 0.0, 1.0)},
                  {'type':'BG/NBD'    , 'l2':hp.uniform('bg_nbd_l2', 0.0, 1.0)}  
                  ]
                )

# define function for model evaluation
def evaluate_model(params):
  
  # accesss replicated input_pd dataframe
  data = inputs.value
  
  # retrieve incoming parameters
  model_type = params['type']
  l2_reg = params['l2']
  
  # instantiate and configure the model
  if model_type == 'BG/NBD':
    model = BetaGeoFitter(penalizer_coef=l2_reg)
  elif model_type == 'Pareto/NBD':
    model = ParetoNBDFitter(penalizer_coef=l2_reg)
  else:
    return {'loss': None, 'status': STATUS_FAIL}
  
  # fit the model
  model.fit(data['frequency_cal'], data['recency_cal'], data['T_cal'])
  
  # evaluate the model
  frequency_holdout_actual = data['frequency_holdout']
  frequency_holdout_predicted = model.predict(data['duration_holdout'], data['frequency_cal'], data['recency_cal'], data['T_cal'])
  mse = score_model(frequency_holdout_actual, frequency_holdout_predicted, 'mse')
  
  # return score and status
  return {'loss': mse, 'status': STATUS_OK}

Notice that the evaluate_model function retrieves its data from a variable named inputs. Inputs is defined in the next cell as a broadcast variable containing the inputs_pd DataFrame used earlier. As a broadcast variable, a complete stand-alone copy of the dataset used by the model is replicated to each worker in the Spark cluster. This limits the amount of data that must be sent from the cluster driver to the workers with each hyperopt iteration. For more information on this and other hyperopt best practices, please refer to this document.

With everything in place, let's perform our hyperparameter tuning over 100 iterations in order to identify the best model type and L2 settings for our dataset:

import mlflow

# replicate input_pd dataframe to workers in Spark cluster
inputs = sc.broadcast(input_pd)

# configure hyperopt settings to distribute to all executors on workers
spark_trials = SparkTrials(parallelism=2)

# select optimization algorithm
algo = tpe.suggest

# perform hyperparameter tuning (logging iterations to mlflow)
argmin = fmin(
  fn=evaluate_model,
  space=search_space,
  algo=algo,
  max_evals=100,
  trials=spark_trials
  )

# release the broadcast dataset
inputs.unpersist()

Hyperopt with SparkTrials will automatically track trials in MLflow. To view the MLflow experiment associated with the notebook, click the 'Runs' icon in the notebook context bar on the upper right. There, you can view all runs. To view logs from trials, please check the Spark executor logs. To view executor logs, expand 'Spark Jobs' above until you see the (i) icon next to the stage from the trial job. Click it and find the list of tasks; Task 0 is the first trial attempt, and subsequent Tasks are retries. Click the 'stderr' link for a task to view trial logs. 0%| | 0/80 [00:00<?, ?trial/s, best loss=?] 1%|▏ | 1/80 [00:05<06:47, 5.16s/trial, best loss: 3.641341814081204] 2%|▎ | 2/80 [00:06<05:09, 3.97s/trial, best loss: 3.6185327681839743] 4%|▍ | 3/80 [00:08<04:23, 3.43s/trial, best loss: 3.554343997973016] 5%|▌ | 4/80 [00:09<03:25, 2.70s/trial, best loss: 3.554343997973016] 8%|▊ | 6/80 [00:11<02:42, 2.19s/trial, best loss: 3.554343997973016] 9%|▉ | 7/80 [00:15<03:22, 2.78s/trial, best loss: 3.5414203015275407] 10%|█ | 8/80 [00:16<02:41, 2.25s/trial, best loss: 3.5414203015275407] 11%|█▏ | 9/80 [00:17<02:12, 1.87s/trial, best loss: 3.5414203015275407] 12%|█▎ | 10/80 [00:20<02:35, 2.21s/trial, best loss: 3.5414203015275407] 14%|█▍ | 11/80 [00:21<02:07, 1.85s/trial, best loss: 3.5414203015275407] 15%|█▌ | 12/80 [00:23<02:09, 1.90s/trial, best loss: 3.5414203015275407] 16%|█▋ | 13/80 [00:24<01:49, 1.63s/trial, best loss: 3.5414203015275407] 19%|█▉ | 15/80 [00:27<01:43, 1.59s/trial, best loss: 3.5414203015275407] 20%|██ | 16/80 [00:31<02:28, 2.32s/trial, best loss: 3.5414203015275407] 21%|██▏ | 17/80 [00:32<02:01, 1.92s/trial, best loss: 3.5414203015275407] 22%|██▎ | 18/80 [00:34<02:00, 1.95s/trial, best loss: 3.5414203015275407] 24%|██▍ | 19/80 [00:36<02:00, 1.97s/trial, best loss: 3.5414203015275407] 25%|██▌ | 20/80 [00:38<01:58, 1.98s/trial, best loss: 3.5414203015275407] 26%|██▋ | 21/80 [00:39<01:42, 1.73s/trial, best loss: 3.540445199381716] 28%|██▊ | 22/80 [00:40<01:27, 1.51s/trial, best loss: 3.540445199381716] 29%|██▉ | 23/80 [00:42<01:34, 1.66s/trial, best loss: 3.540445199381716] 30%|███ | 24/80 [00:43<01:22, 1.47s/trial, best loss: 3.540445199381716] 31%|███▏ | 25/80 [00:45<01:29, 1.63s/trial, best loss: 3.540445199381716] 32%|███▎ | 26/80 [00:46<01:17, 1.44s/trial, best loss: 3.540445199381716] 34%|███▍ | 27/80 [00:48<01:25, 1.61s/trial, best loss: 3.540445199381716] 35%|███▌ | 28/80 [00:49<01:14, 1.43s/trial, best loss: 3.540445199381716] 36%|███▋ | 29/80 [00:51<01:21, 1.61s/trial, best loss: 3.540445199381716] 38%|███▊ | 30/80 [00:52<01:11, 1.42s/trial, best loss: 3.540445199381716] 39%|███▉ | 31/80 [00:55<01:18, 1.60s/trial, best loss: 3.540445199381716] 40%|████ | 32/80 [00:56<01:08, 1.42s/trial, best loss: 3.540445199381716] 41%|████▏ | 33/80 [00:58<01:15, 1.60s/trial, best loss: 3.540445199381716] 42%|████▎ | 34/80 [00:59<01:05, 1.42s/trial, best loss: 3.540445199381716] 44%|████▍ | 35/80 [01:01<01:11, 1.60s/trial, best loss: 3.540445199381716] 46%|████▋ | 37/80 [01:04<01:07, 1.57s/trial, best loss: 3.540445199381716] 48%|████▊ | 38/80 [01:07<01:24, 2.01s/trial, best loss: 3.540445199381716] 49%|████▉ | 39/80 [01:08<01:09, 1.71s/trial, best loss: 3.540445199381716] 51%|█████▏ | 41/80 [01:11<01:04, 1.65s/trial, best loss: 3.540445199381716] 54%|█████▍ | 43/80 [01:15<01:04, 1.76s/trial, best loss: 3.540445199381716] 55%|█████▌ | 44/80 [01:18<01:16, 2.13s/trial, best loss: 3.540445199381716] 56%|█████▋ | 45/80 [01:20<01:13, 2.10s/trial, best loss: 3.540445199381716] 57%|█████▊ | 46/80 [01:21<01:00, 1.77s/trial, best loss: 3.540445199381716] 59%|█████▉ | 47/80 [01:23<01:00, 1.84s/trial, best loss: 3.540445199381716] 60%|██████ | 48/80 [01:25<01:00, 1.89s/trial, best loss: 3.540445199381716] 61%|██████▏ | 49/80 [01:26<00:50, 1.63s/trial, best loss: 3.540445199381716] 64%|██████▍ | 51/80 [01:29<00:46, 1.59s/trial, best loss: 3.540445199381716] 65%|██████▌ | 52/80 [01:32<00:56, 2.02s/trial, best loss: 3.540445199381716] 66%|██████▋ | 53/80 [01:33<00:46, 1.71s/trial, best loss: 3.540445199381716] 68%|██████▊ | 54/80 [01:35<00:46, 1.80s/trial, best loss: 3.540445199381716] 69%|██████▉ | 55/80 [01:36<00:39, 1.56s/trial, best loss: 3.540445199381716] 70%|███████ | 56/80 [01:38<00:40, 1.70s/trial, best loss: 3.540445199381716] 71%|███████▏ | 57/80 [01:39<00:34, 1.49s/trial, best loss: 3.540445199381716] 72%|███████▎ | 58/80 [01:41<00:36, 1.65s/trial, best loss: 3.540445199381716] 74%|███████▍ | 59/80 [01:42<00:30, 1.45s/trial, best loss: 3.540445199381716] 75%|███████▌ | 60/80 [01:44<00:32, 1.62s/trial, best loss: 3.540445199381716] 76%|███████▋ | 61/80 [01:45<00:27, 1.44s/trial, best loss: 3.540445199381716] 78%|███████▊ | 62/80 [01:47<00:28, 1.61s/trial, best loss: 3.540445199381716] 79%|███████▉ | 63/80 [01:48<00:24, 1.43s/trial, best loss: 3.540445199381716] 80%|████████ | 64/80 [01:50<00:25, 1.60s/trial, best loss: 3.540445199381716] 81%|████████▏ | 65/80 [01:51<00:21, 1.42s/trial, best loss: 3.540445199381716] 82%|████████▎ | 66/80 [01:53<00:22, 1.60s/trial, best loss: 3.540445199381716] 84%|████████▍ | 67/80 [01:54<00:18, 1.42s/trial, best loss: 3.540445199381716] 85%|████████▌ | 68/80 [01:56<00:19, 1.60s/trial, best loss: 3.540445199381716] 86%|████████▋ | 69/80 [01:57<00:15, 1.42s/trial, best loss: 3.540445199381716] 88%|████████▊ | 70/80 [01:59<00:15, 1.60s/trial, best loss: 3.540445199381716] 89%|████████▉ | 71/80 [02:00<00:12, 1.42s/trial, best loss: 3.540445199381716] 90%|█████████ | 72/80 [02:02<00:12, 1.60s/trial, best loss: 3.540445199381716] 91%|█████████▏| 73/80 [02:03<00:09, 1.42s/trial, best loss: 3.540445199381716] 92%|█████████▎| 74/80 [02:05<00:09, 1.60s/trial, best loss: 3.540445199381716] 94%|█████████▍| 75/80 [02:06<00:07, 1.42s/trial, best loss: 3.540445199381716] 95%|█████████▌| 76/80 [02:08<00:06, 1.60s/trial, best loss: 3.540445199381716] 96%|█████████▋| 77/80 [02:09<00:04, 1.42s/trial, best loss: 3.540445199381716] 99%|█████████▉| 79/80 [02:12<00:01, 1.45s/trial, best loss: 3.540445199381716] 100%|██████████| 80/80 [02:15<00:00, 1.91s/trial, best loss: 3.540445199381716] Total Trials: 80: 80 succeeded, 0 failed, 0 cancelled.

When used with the Databricks ML runtime, the individual runs that make up the search space evaluation are tracked in a built-in repository called mlflow. For more information on how to review the models generated by hyperopt using the Databricks mlflow interface, please check out this document.

The optimal hyperparameter settings observed during the hyperopt iterations are captured in the argmin variable. Using the space_eval function, we can obtain a friendly representation of which settings performed best:

# print optimum hyperparameter settings
print(space_eval(search_space, argmin))

{'l2': 0.9975590906220992, 'type': 'BG/NBD'}

# get hyperparameter settings
params = space_eval(search_space, argmin)
model_type = params['type']
l2_reg = params['l2']

# instantiate and configure model
if model_type == 'BG/NBD':
  model = BetaGeoFitter(penalizer_coef=l2_reg)
elif model_type == 'Pareto/NBD':
  model = ParetoNBDFitter(penalizer_coef=l2_reg)
else:
  raise 'Unrecognized model type'
  
# train the model
model.fit(input_pd['frequency_cal'], input_pd['recency_cal'], input_pd['T_cal'])

Out[23]: <lifetimes.BetaGeoFitter: fitted with 2163 subjects, a: 0.01, alpha: 18.28, b: 0.07, r: 0.45>

# score the model
frequency_holdout_actual = input_pd['frequency_holdout']
frequency_holdout_predicted = model.predict(input_pd['duration_holdout'], input_pd['frequency_cal'], input_pd['recency_cal'], input_pd['T_cal'])
mse = score_model(frequency_holdout_actual, frequency_holdout_predicted, 'mse')

print('MSE: {0}'.format(mse))

MSE: 3.540430294872508

from lifetimes.plotting import plot_calibration_purchases_vs_holdout_purchases

plot_calibration_purchases_vs_holdout_purchases(
  model, 
  input_pd, 
  n=90, 
  **{'figsize':(8,8)}
  )

display()

What we see here is that a higher number of purchases in the calibration period predicts a higher average number of purchases in the holdout period but the actual values diverge sharply from model predictions when we consider customers with a large number of purchases (>60) in the calibration period. Thinking back to the charts in the data exploration section of this notebook, you might recall that there are very few customers with such a large number of purchases so that this divergence may be a result of a very limited number of instances at the higher end of the frequency range. More data may bring the predicted and actuals back together at this higher end of the curve. If this divergence persists, it may indicate a range of customer engagement frequency above which we cannot make reliable predictions.

Using the same method call, we can visualize time since last purchase relative to the average number of purchases in the holdout period. This visualization illustrates that as time since the last purchase increases, the number of purchases in the holdout period decreases. In otherwords, those customers we haven't seen in a while aren't likely coming back anytime soon:

NOTE As before, we will hide the code in the following cells to focus on the visualizations. Use Show code to see the associated Python logic.

Show code

# add a field with the probability a customer is currently "alive"
filtered_pd['prob_alive']=model.conditional_probability_alive(
    filtered_pd['frequency'], 
    filtered_pd['recency'], 
    filtered_pd['T']
    )

filtered_pd.head(10)

Out[28]:

from lifetimes.plotting import plot_history_alive
import matplotlib.pyplot as plt

# clear past visualization instructions
plt.clf()

# customer of interest
CustomerID = '12383'

# grab customer's metrics and transaction history
cmetrics_pd = input_pd[input_pd['CustomerID']==CustomerID]
trans_history = orders_pd.loc[orders_pd['CustomerID'] == CustomerID]

# calculate age at end of dataset
days_since_birth = 400

# plot history of being "alive"
plot_history_alive(
  model, 
  days_since_birth, 
  trans_history, 
  'InvoiceDate'
  )

display()

From this chart, we can see this customer made his or her first purchase in January 2011 followed by a repeat purchase later that month. There was about a 1-month lull in activity during which the probability of the customer being alive declined slightly but with purchases in March, April and June of that year, the customer sent repeated signals that he or she was engaged. Since that last June purchase, the customer hasn't been seen in our transaction history, and our belief that the customer remains engaged has been dropping though as a moderate pace given the signals previously sent.

How does the model arrive at these probabilities? The exact math is tricky but by plotting the probability of being alive as a heatmap relative to frequency and recency, we can understand the probabilities assigned to the intersections of these two values:

from lifetimes.plotting import plot_probability_alive_matrix

# set figure size
plt.subplots(figsize=(12, 8))

plot_probability_alive_matrix(model)

display()

Show code

filtered_pd['purchases_next30days']=(
  model.conditional_expected_number_of_purchases_up_to_time(
    30, 
    filtered_pd['frequency'], 
    filtered_pd['recency'], 
    filtered_pd['T']
    )
  )

filtered_pd.head(10)

Out[32]:

frequency = 6
recency = 255
T = 300
t = 30

print('Probability of Alive: {0}'.format( model.conditional_probability_alive(frequency, recency, T) ))
print('Expected Purchases in next {0} days: {1}'.format(t, model.conditional_expected_number_of_purchases_up_to_time(t, frequency, recency, T) ))

Probability of Alive: 0.9949186328353091 Expected Purchases in next 30 days: 0.6048476679280559

The challenge now is to package our model into something we could re-use for this purpose. Earlier, we used mlflow in combination with hyperopt to capture model runs during the hyperparameter tuning exercise. As a platform, mlflow is designed to solve a wide range of challenges that come with model development and deployment, including the deployment of models as functions and microservice applications.

MLFlow tackles deployment challenges out of the box for a number of popular model types. However, lifetimes models are not one of these. To use mlflow as our deployment vehicle, we need to write a custom wrapper class which translates the standard mlflow API calls into logic which can be applied against our model.

To illustrate this, we've implemented a wrapper class for our lifetimes model which maps the mlflow predict() method to multiple prediction calls against our model. Typically, we'd map predict() to a single prediction but we've bumped up the complexity of the returned result to show one of many ways the wrapper may be employed to implement custom logic:

import mlflow
import mlflow.pyfunc

# create wrapper for lifetimes model
class _lifetimesModelWrapper(mlflow.pyfunc.PythonModel):
  
    def __init__(self, lifetimes_model):
        self.lifetimes_model = lifetimes_model

    def predict(self, context, dataframe):
      
      # access input series
      frequency = dataframe.iloc[:,0]
      recency = dataframe.iloc[:,1]
      T = dataframe.iloc[:,2]
      
      # calculate probability currently alive
      results = pd.DataFrame( 
                  self.lifetimes_model.conditional_probability_alive(frequency, recency, T),
                  columns=['alive']
                  )
      # calculate expected purchases for provided time period
      results['purch_15day'] = self.lifetimes_model.conditional_expected_number_of_purchases_up_to_time(15, frequency, recency, T)
      results['purch_30day'] = self.lifetimes_model.conditional_expected_number_of_purchases_up_to_time(30, frequency, recency, T)
      results['purch_45day'] = self.lifetimes_model.conditional_expected_number_of_purchases_up_to_time(45, frequency, recency, T)
      
      return results[['alive', 'purch_15day', 'purch_30day', 'purch_45day']]

We now need to register our model with mlflow. As we do this, we inform it of the wrapper that maps its expected API to the model's functionality. We also provide environment information to instruct it as to which libraries it needs to install and load for our model to work:

NOTE We would typically train and log our model as a single step but in this notebook we've separated the two actions in order to focus here on custom model deployment. For examine more common patterns of mlflow implementation, please refer to this and other examples available online.

# add lifetimes to conda environment info
conda_env = mlflow.pyfunc.get_default_conda_env()
conda_env['dependencies'][1]['pip'] += ['lifetimes==0.10.1'] # version should match version installed at top of this notebook

# save model run to mlflow
with mlflow.start_run(run_name='deployment run') as run:
  mlflow.pyfunc.log_model(
    'model', 
    python_model=_lifetimesModelWrapper(model), 
    conda_env=conda_env
    )

from pyspark.sql.types import ArrayType, FloatType

# define the schema of the values returned by the function
result_schema = ArrayType(FloatType())

# define function based on mlflow recorded model
probability_alive_udf = mlflow.pyfunc.spark_udf(
  spark, 
  'runs:/{0}/model'.format(run.info.run_id), 
  result_type=result_schema
  )

# register the function for use in SQL
_ = spark.udf.register('probability_alive', probability_alive_udf)

# create a temp view for SQL demonstration (next cell)
filtered.createOrReplaceTempView('customer_metrics')

# demonstrate function call on Spark DataFrame
display(
  filtered
    .withColumn(
      'predictions', 
      probability_alive_udf(filtered.frequency, filtered.recency, filtered.T)
      )
    .selectExpr(
      'customerid', 
      'predictions[0] as prob_alive', 
      'predictions[1] as purch_15day', 
      'predictions[2] as purch_30day', 
      'predictions[3] as purch_45day'
      )
  )


12347	0.99802893	0.25069126	0.50127274	0.7517487
12348	0.99001604	0.13625044	0.27242857	0.4085386
12352	0.9958301	0.30663317	0.61310196	0.9194168
12356	0.9894504	0.1060325	0.21199518	0.3178914
12358	0.8743076	0.113154866	0.22611618	0.33890733
12359	0.9973501	0.23351312	0.4669092	0.7001932
12360	0.9826387	0.16557297	0.33097973	0.49623367
12362	0.9990152	0.59553087	1.1907713	1.7857268
12363	0.76127005	0.063720696	0.12736756	0.19094677
12364	0.99436814	0.39505592	0.7895588	1.1835755
12370	0.9923419	0.1358486	0.2716262	0.4073348
12371	0.67346746	0.18960562	0.37859803	0.5671088
12372	0.9832972	0.115092844	0.23010291	0.34503567
12375	0.99063146	0.31320003	0.62585086	0.9380315
12377	0.31607297	0.018500347	0.03698529	0.055455804
12379	0.96523	0.1885385	0.3768604	0.5649861
12380	0.99489933	0.3267796	0.6532735	0.9795058
12381	0.99721223	0.5938103	1.1869143	1.7793927
12383	0.90842324	0.20063242	0.40117076	0.6016164
12384	0.98452526	0.25990847	0.5194268	0.77860343
12386	0.25033706	0.0141953295	0.028379194	0.042552274
12388	0.9969758	0.23681472	0.47350806	0.7100879
12393	0.98974586	0.14632088	0.2925591	0.43871808
12394	0.8168763	0.07563557	0.15117508	0.22662728
12395	0.9986412	0.51758224	1.0349598	1.5521383
12397	0.8280036	0.117638625	0.23505914	0.3522898
12399	0.9691357	0.17968436	0.35924244	0.53868294
12406	0.98810405	0.18054804	0.3609005	0.5410754
12407	0.99267995	0.23483118	0.46951085	0.7040482
12408	0.99581665	0.34628588	0.69236237	1.0382422
12409	0.97492605	0.25202185	0.50380355	0.75536454
12410	0.14568311	0.009728697	0.01944824	0.029159257
12412	0.9698275	0.19044745	0.38067502	0.57070374
12413	0.990573	0.14438279	0.28868505	0.4329103
12414	0.885901	0.099272326	0.19847605	0.2976154
12415	0.9979625	0.7352128	1.4701262	2.2047327
12417	0.9989472	0.4572057	0.9142237	1.371067
12421	0.9945533	0.15269709	0.30530462	0.4578266
12422	0.98027545	0.105356105	0.21064253	0.3158631
12423	0.9986325	0.34097382	0.6818004	1.0224867
12427	0.99642855	0.17094287	0.34180453	0.5125883
12428	0.9972867	0.46925128	0.93825424	1.4070255
12429	0.9949451	0.1344286	0.26878753	0.40307957
12431	0.9970812	0.5905625	1.180902	1.7710152
12432	0.98356414	0.19010161	0.37998673	0.56967616
12433	0.9976265	0.20851102	0.4169295	0.6252564
12434	0.9889832	0.1353888	0.27070686	0.40595618
12435	0.81412506	0.062176377	0.12428652	0.18633555
12437	0.99928135	0.6630207	1.3257569	1.9882154
12438	0.85980445	0.11830155	0.2363897	0.3542916
12444	0.99470216	0.35086817	0.70140946	1.0516504
12449	0.9931248	0.2504758	0.5007187	0.75074697
12450	0.37556663	0.042766776	0.085468195	0.12811135
12451	0.99721104	0.23825434	0.4763868	0.7144037
12454	0.5328491	0.1560641	0.3116082	0.46674675
12455	0.99004364	0.25760975	0.5150761	0.7724084
12456	0.991471	0.18854976	0.37696385	0.5652518
12457	0.98827225	0.48690847	0.97351044	1.4598337
12458	0.8262341	0.05955426	0.119048394	0.17848681
12461	0.6903708	0.087276675	0.17440714	0.2614086
12462	0.99524045	0.16040988	0.32072073	0.48093894
12463	0.99314535	0.31321454	0.6262204	0.93903285
12464	0.99767226	0.29500157	0.5898507	0.88455534
12465	0.9901473	0.19443771	0.38865092	0.58266115
12468	0.9655777	0.106259264	0.21244651	0.31856576
12471	0.99968195	1.4771804	2.9537585	4.4298725
12472	0.9976441	0.4761811	0.95217866	1.42799
12473	0.9939075	0.31709808	0.63392776	0.95050776
12474	0.99873716	0.9869498	1.9735465	2.9597437
12476	0.9991697	0.531505	1.0627979	1.5938787
12477	0.99754924	0.38403833	0.7678897	1.1515616
12479	0.9883229	1.2379127	2.46988	3.6978498
12480	0.99381906	0.14526477	0.29044777	0.43555373
12481	0.9977481	0.3300081	0.65987927	0.9896145
12483	0.99822646	0.43798792	0.8757893	1.313401
12484	0.9967804	0.31447604	0.6288033	0.9429928
12488	0.9885091	0.38950002	0.7781805	1.1661786
12490	0.998626	0.40765405	0.8151227	1.2224165
12492	0.71269083	0.12586617	0.25145167	0.37679976
12493	0.7568859	0.12931046	0.25848895	0.3875471
12494	0.99863356	0.40845057	0.8167364	1.2248585
12498	0.81350523	0.23510505	0.46943447	0.70315784
12500	0.99783844	0.44529635	0.89039534	1.3353014
12501	0.22303306	0.013717511	0.027423065	0.04111742
12502	0.98568374	0.20112433	0.40213582	0.60304
12504	0.97232246	0.7394737	1.4763801	2.211371
12507	0.96615654	0.11237172	0.22466323	0.33687943
12508	0.6874254	0.3159979	0.63054883	0.9440692
12510	0.7635799	0.048467703	0.096891895	0.14527546
12513	0.6092453	0.10182229	0.20342776	0.30484852
12516	0.7744907	0.085082196	0.17003845	0.25488195
12517	0.9923535	0.24667828	0.49312973	0.7393725
12518	0.99685514	0.51871544	1.0367448	1.554172
12520	0.9868187	0.27703893	0.5539132	0.8306329
12522	0.6781836	0.2129378	0.42512837	0.63674384
12523	0.9990946	0.53420544	1.0681738	1.6019169
12524	0.9980683	0.3175694	0.6349928	0.9522708
12526	0.9910184	0.32447344	0.64836174	0.97174984
12527	0.9827627	0.10531583	0.21056232	0.31574276
12528	0.99804413	0.3377483	0.6753287	1.0127505

Showing the first 1000 rows.

%sql -- predict probabiliies customer is alive and will return in 15, 30 & 45 days

SELECT
  x.CustomerID,
  x.prediction[0] as prob_alive,
  x.prediction[1] as purch_15day,
  x.prediction[2] as purch_30day,
  x.prediction[3] as purch_45day
FROM (
  SELECT
    CustomerID,
    probability_alive(frequency, recency, T) as prediction
  FROM customer_metrics
  ) x;


12347	0.99802893	0.25069126	0.50127274	0.7517487
12348	0.99001604	0.13625044	0.27242857	0.4085386
12352	0.9958301	0.30663317	0.61310196	0.9194168
12356	0.9894504	0.1060325	0.21199518	0.3178914
12358	0.8743076	0.113154866	0.22611618	0.33890733
12359	0.9973501	0.23351312	0.4669092	0.7001932
12360	0.9826387	0.16557297	0.33097973	0.49623367
12362	0.9990152	0.59553087	1.1907713	1.7857268
12363	0.76127005	0.063720696	0.12736756	0.19094677
12364	0.99436814	0.39505592	0.7895588	1.1835755
12370	0.9923419	0.1358486	0.2716262	0.4073348
12371	0.67346746	0.18960562	0.37859803	0.5671088
12372	0.9832972	0.115092844	0.23010291	0.34503567
12375	0.99063146	0.31320003	0.62585086	0.9380315
12377	0.31607297	0.018500347	0.03698529	0.055455804
12379	0.96523	0.1885385	0.3768604	0.5649861
12380	0.99489933	0.3267796	0.6532735	0.9795058
12381	0.99721223	0.5938103	1.1869143	1.7793927
12383	0.90842324	0.20063242	0.40117076	0.6016164
12384	0.98452526	0.25990847	0.5194268	0.77860343
12386	0.25033706	0.0141953295	0.028379194	0.042552274
12388	0.9969758	0.23681472	0.47350806	0.7100879
12393	0.98974586	0.14632088	0.2925591	0.43871808
12394	0.8168763	0.07563557	0.15117508	0.22662728
12395	0.9986412	0.51758224	1.0349598	1.5521383
12397	0.8280036	0.117638625	0.23505914	0.3522898
12399	0.9691357	0.17968436	0.35924244	0.53868294
12406	0.98810405	0.18054804	0.3609005	0.5410754
12407	0.99267995	0.23483118	0.46951085	0.7040482
12408	0.99581665	0.34628588	0.69236237	1.0382422
12409	0.97492605	0.25202185	0.50380355	0.75536454
12410	0.14568311	0.009728697	0.01944824	0.029159257
12412	0.9698275	0.19044745	0.38067502	0.57070374
12413	0.990573	0.14438279	0.28868505	0.4329103
12414	0.885901	0.099272326	0.19847605	0.2976154
12415	0.9979625	0.7352128	1.4701262	2.2047327
12417	0.9989472	0.4572057	0.9142237	1.371067
12421	0.9945533	0.15269709	0.30530462	0.4578266
12422	0.98027545	0.105356105	0.21064253	0.3158631
12423	0.9986325	0.34097382	0.6818004	1.0224867
12427	0.99642855	0.17094287	0.34180453	0.5125883
12428	0.9972867	0.46925128	0.93825424	1.4070255
12429	0.9949451	0.1344286	0.26878753	0.40307957
12431	0.9970812	0.5905625	1.180902	1.7710152
12432	0.98356414	0.19010161	0.37998673	0.56967616
12433	0.9976265	0.20851102	0.4169295	0.6252564
12434	0.9889832	0.1353888	0.27070686	0.40595618
12435	0.81412506	0.062176377	0.12428652	0.18633555
12437	0.99928135	0.6630207	1.3257569	1.9882154
12438	0.85980445	0.11830155	0.2363897	0.3542916
12444	0.99470216	0.35086817	0.70140946	1.0516504
12449	0.9931248	0.2504758	0.5007187	0.75074697
12450	0.37556663	0.042766776	0.085468195	0.12811135
12451	0.99721104	0.23825434	0.4763868	0.7144037
12454	0.5328491	0.1560641	0.3116082	0.46674675
12455	0.99004364	0.25760975	0.5150761	0.7724084
12456	0.991471	0.18854976	0.37696385	0.5652518
12457	0.98827225	0.48690847	0.97351044	1.4598337
12458	0.8262341	0.05955426	0.119048394	0.17848681
12461	0.6903708	0.087276675	0.17440714	0.2614086
12462	0.99524045	0.16040988	0.32072073	0.48093894
12463	0.99314535	0.31321454	0.6262204	0.93903285
12464	0.99767226	0.29500157	0.5898507	0.88455534
12465	0.9901473	0.19443771	0.38865092	0.58266115
12468	0.9655777	0.106259264	0.21244651	0.31856576
12471	0.99968195	1.4771804	2.9537585	4.4298725
12472	0.9976441	0.4761811	0.95217866	1.42799
12473	0.9939075	0.31709808	0.63392776	0.95050776
12474	0.99873716	0.9869498	1.9735465	2.9597437
12476	0.9991697	0.531505	1.0627979	1.5938787
12477	0.99754924	0.38403833	0.7678897	1.1515616
12479	0.9883229	1.2379127	2.46988	3.6978498
12480	0.99381906	0.14526477	0.29044777	0.43555373
12481	0.9977481	0.3300081	0.65987927	0.9896145
12483	0.99822646	0.43798792	0.8757893	1.313401
12484	0.9967804	0.31447604	0.6288033	0.9429928
12488	0.9885091	0.38950002	0.7781805	1.1661786
12490	0.998626	0.40765405	0.8151227	1.2224165
12492	0.71269083	0.12586617	0.25145167	0.37679976
12493	0.7568859	0.12931046	0.25848895	0.3875471
12494	0.99863356	0.40845057	0.8167364	1.2248585
12498	0.81350523	0.23510505	0.46943447	0.70315784
12500	0.99783844	0.44529635	0.89039534	1.3353014
12501	0.22303306	0.013717511	0.027423065	0.04111742
12502	0.98568374	0.20112433	0.40213582	0.60304
12504	0.97232246	0.7394737	1.4763801	2.211371
12507	0.96615654	0.11237172	0.22466323	0.33687943
12508	0.6874254	0.3159979	0.63054883	0.9440692
12510	0.7635799	0.048467703	0.096891895	0.14527546
12513	0.6092453	0.10182229	0.20342776	0.30484852
12516	0.7744907	0.085082196	0.17003845	0.25488195
12517	0.9923535	0.24667828	0.49312973	0.7393725
12518	0.99685514	0.51871544	1.0367448	1.554172
12520	0.9868187	0.27703893	0.5539132	0.8306329
12522	0.6781836	0.2129378	0.42512837	0.63674384
12523	0.9990946	0.53420544	1.0681738	1.6019169
12524	0.9980683	0.3175694	0.6349928	0.9522708
12526	0.9910184	0.32447344	0.64836174	0.97174984
12527	0.9827627	0.10531583	0.21056232	0.31574276
12528	0.99804413	0.3377483	0.6753287	1.0127505

Showing the first 1000 rows.

Calculating the Probability of Future Customer Engagement

Step 1: Setup the Environment

Step 2: Explore the Dataset

Step 3: Calculate Customer Metrics

Step 4: Train the Model

Step 5: Evaluate the Model

Step 6: Deploy the Model for Predictions