import pandas as pd import numpy as np # identify name of xlsx file (which will change when uploaded) xlsx_filename = dbutils.fs.ls('file:///dbfs/FileStore/tables/online_retail')[0][0] # schema of the excel spreadsheet data range orders_schema = { 'InvoiceNo':np.str, 'StockCode':np.str, 'Description':np.str, 'Quantity':np.int64, 'InvoiceDate':np.datetime64, 'UnitPrice':np.float64, 'CustomerID':np.str, 'Country':np.str } # read spreadsheet to pandas dataframe # the xlrd library must be installed for this step to work orders_pd = pd.read_excel( xlsx_filename, sheet_name='Online Retail', header=0, # first row is header dtype=orders_schema ) # display first few rows from the dataset orders_pd.head(10)
Out[1]:
import lifetimes # set the last transaction date as the end point for this historical dataset current_date = orders_pd['InvoiceDate'].max() # calculate the required customer metrics metrics_pd = ( lifetimes.utils.summary_data_from_transaction_data( orders_pd, customer_id_col='CustomerID', datetime_col='InvoiceDate', observation_period_end = current_date, freq='D' ) ) # display first few rows metrics_pd.head(10)
Out[3]:
# sql statement to derive summary customer stats sql = ''' SELECT a.customerid as CustomerID, CAST(COUNT(DISTINCT a.transaction_at) - 1 as float) as frequency, CAST(DATEDIFF(MAX(a.transaction_at), a.first_at) as float) as recency, CAST(DATEDIFF(a.current_dt, a.first_at) as float) as T FROM ( -- customer order history SELECT DISTINCT x.customerid, z.first_at, TO_DATE(x.invoicedate) as transaction_at, y.current_dt FROM orders x CROSS JOIN (SELECT MAX(TO_DATE(invoicedate)) as current_dt FROM orders) y -- current date (according to dataset) INNER JOIN (SELECT customerid, MIN(TO_DATE(invoicedate)) as first_at FROM orders GROUP BY customerid) z -- first order per customer ON x.customerid=z.customerid WHERE x.customerid IS NOT NULL ) a GROUP BY a.customerid, a.current_dt, a.first_at ORDER BY CustomerID ''' # capture stats in dataframe metrics_sql = spark.sql(sql) # display stats display(metrics_sql)
from pyspark.sql.functions import to_date, datediff, max, min, countDistinct, count, sum, when from pyspark.sql.types import * # valid customer orders x = orders.where(orders.CustomerID.isNotNull()) # calculate last date in dataset y = ( orders .groupBy() .agg(max(to_date(orders.InvoiceDate)).alias('current_dt')) ) # calculate first transaction date by customer z = ( orders .groupBy(orders.CustomerID) .agg(min(to_date(orders.InvoiceDate)).alias('first_at')) ) # combine customer history with date info a = (x .crossJoin(y) .join(z, x.CustomerID==z.CustomerID, how='inner') .select( x.CustomerID.alias('customerid'), z.first_at, to_date(x.InvoiceDate).alias('transaction_at'), y.current_dt ) .distinct() ) # calculate relevant metrics by customer metrics_api = (a .groupBy(a.customerid, a.current_dt, a.first_at) .agg( (countDistinct(a.transaction_at)-1).cast(FloatType()).alias('frequency'), datediff(max(a.transaction_at), a.first_at).cast(FloatType()).alias('recency'), datediff(a.current_dt, a.first_at).cast(FloatType()).alias('T') ) .select('customerid','frequency','recency','T') .orderBy('customerid') ) display(metrics_api)
Calculating the Probability of Future Customer Engagement
In non-subscription retail models, customers come and go with no long-term commitments, making it very difficult to determine whether a customer will return in the future. Determining the probability that a customer will re-engage is critical to the design of effective marketing campaigns. Different messaging and promotions may be required to incentivize customers who have likely dropped out to return to our stores. Engaged customers may be more responsive to marketing that encourages them to expand the breadth and scale of purchases with us. Understanding where our customers land with regard to the probability of future engagement is critical to tailoring our marketing efforts to them.
The Buy 'til You Die (BTYD) models popularized by Peter Fader and others leverage two basic customer metrics, i.e. the recency of a customer's last engagement and the frequency of repeat transactions over a customer's lifetime, to derive a probability of future re-engagement. This is done by fitting customer history to curves describing the distribution of purchase frequencies and engagement drop-off following a prior purchase. The math behind these models is fairly complex but thankfully it's been encapsulated in the lifetimes library, making it much easier for traditional enterprises to employ. The purpose of this notebook is to examine how these models may be applied to customer transaction history and how they may be deployed for integration in marketing processes.
Last refresh: Never