Survival Analysis 02: Exploratory Analysis(Python)
Loading...

With our subscription data prepared, we can now begin examining patterns of customer attrition observed with the KKBox music service. In this notebook, we'll get oriented to general patterns of customer dropout in preparation for the more detailed work taking place in the next notebook.

Note This notebook has been revised as of July 20, 2020

Step 1: Prepare the Environment

The techniques we'll use in this and subsequent notebooks come from the domain of Survival Analysis. While there are several notebooks available in Python that support these techniques, we'll leverage lifelines, the most popular of the survival analysis libraries currently available. To do this, we'll first need to install and load the library to our cluster:

NOTE The next cell assumes you are running this notebook on a Databricks cluster that does not make use of the ML runtime. If using an ML runtime, please follow these alternative steps to load the lifelines library to your environment.

dbutils.library.installPyPI('lifelines')
dbutils.library.restartPython()
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from lifelines import KaplanMeierFitter
from lifelines.utils import median_survival_times
from lifelines.statistics import pairwise_logrank_test

Step 2: Examine Population-Level Survivorship

Using the full subscription dataset, we can take a look at how members dropout of the subscription service over time. To do this, we will derive a Kaplan-Meier curve which will identify the probability of survival at a point in time (in terms of days since subscription inception) using some simple statistical calculations derived from the observed data:

# retrieve subscription data to pandas DF
subscriptions_pd = spark.table('kkbox.subscriptions').toPandas()
subscriptions_pd.head()
Out[2]:
kmf = KaplanMeierFitter(alpha=0.05) # calculate a 95% confidence interval
kmf.fit(subscriptions_pd['duration_days'], subscriptions_pd['churned'])
Out[3]: <lifelines.KaplanMeierFitter:"KM_estimate", fitted with 3.13028e+06 total observations, 1.22578e+06 right-censored observations>

The output of the previous step tells us we fit the KM model using nearly 3.1 million subscription records of which about 1.3 million were still active as of April 1, 2017. (The term right-censored tells us that the event of interest, i.e. churn, has not occurred within our observation window.) Using this model, we can now calculate the median survival time for any given subscription:

median_ = kmf.median_survival_time_
median_
Out[4]: 184.0

Per documentation associated with the dataset, KKBox members typically subscribe to the service on a 30-day cycle. The median survival time of 184-days would indicate that most customers sign-up for an initial 30-day term and renew subscriptions month over month for 6-months on average before dropping out.

Passing the model various values that correspond with 30-day initial registration, a 1-year commitment, and a 2-year renewal, we can calculate the probability a customer continues with the service, i.e. survives past that point in time:

kmf.predict([30, 365, 730])
Out[5]: 30 0.778921 365 0.424545 730 0.298931 Name: KM_estimate, dtype: float64

Graphing this out, we can see how customer drop-out varies as subscriptions age:

# plot attributes
plt.figure(figsize=(12,8))
plt.title('All Subscriptions', fontsize='xx-large')

# y-axis
plt.ylabel('Survival Rate', fontsize='x-large')
plt.ylim((0.0, 1.0))
plt.yticks(np.arange(0.0,1.0,0.05))
plt.axhline(0.5, color='red', alpha=0.75, linestyle=':') # median line in red

# x-axis
plt.xlabel('Timeline (Days)', fontsize='x-large')
plt.xticks(range(0,800,30))
plt.axvline(30, color='gray', alpha=0.5, linestyle=':')  # 30-day gray dashed line
plt.axvline(180, color='gray', alpha=0.5, linestyle=':')  # 30-day gray dashed line
plt.axvline(365, color='gray', alpha=0.5, linestyle=':') # 1-year gray dashed line
plt.axvline(365*2, color='gray', alpha=0.5, linestyle=':') # 2-year gray dashed line

kmf.plot_survival_function()

Step 3: Examine How Survivorship Varies

The overall pattern of survivorship tells a pretty compelling story about customer attrition at KKBox, but it can be interesting to examine how this pattern varies by subscription attributes. By focusing on attributes of the subscription known at the time of its creation, we may be able to identify variables that tell us something about the long-term retention probability on an account at a time when we may be offering the greatest incentives to acquire a customer. With this in mind, we'll examine the registration channel associated with the subscription along with the initial payment method and payment plan days selected at initial registration:

%sql  -- identify registration channel, initial payment method and initial payment plan days by subscription
DROP VIEW IF EXISTS subscription_attributes;

CREATE TEMP VIEW subscription_attributes 
AS
  WITH transaction_attributes AS ( -- get payment method and plan days for each subscriber's transaction date
      SELECT
        a.msno,
        a.trans_at,
        FIRST(b.payment_method_id) as payment_method_id,
        FIRST(b.payment_plan_days) as payment_plan_days
      FROM (  -- base transaction dates
        SELECT 
          msno,
          transaction_date as trans_at,
          MAX(membership_expire_date) as expires_at
        FROM kkbox.transactions
        GROUP BY msno, transaction_date
        ) a
      INNER JOIN kkbox.transactions b
        ON  a.msno=b.msno AND
            a.trans_at=b.transaction_date AND 
            a.expires_at=b.membership_expire_date
      WHERE b.payment_plan_days > 0
      GROUP BY
        a.msno,
        a.trans_at
      )
  SELECT
    m.*,
    n.payment_method_id as init_payment_method_id,
    n.payment_plan_days as init_payment_plan_days,
    COALESCE(CAST(o.registered_via as String), 'Unknown') as registered_via
  FROM kkbox.subscriptions m
  INNER JOIN transaction_attributes n
    ON  m.msno=n.msno AND      -- match on customer
        m.starts_at=n.trans_at -- and transaction date at start of transaction
  LEFT OUTER JOIN kkbox.members o  -- membership info (assume stable across subscription)
    ON m.msno=o.msno
  ORDER BY m.subscription_id
OK
# capture output to Spark DataFrame
subscriptions = spark.table('subscription_attributes')

# capture output to pandas DataFrame
subscriptions_pd = subscriptions.toPandas()
subscriptions_pd.head()
Out[7]:

With attributes now associated with our subscriptions, let's examine the registration channels by which customers subscribe to the service:

%sql

SELECT
  registered_via,
  COUNT(DISTINCT msno) as members
FROM subscription_attributes
GROUP BY registered_via
ORDER BY members DESC
7677745
9631896
Unknown436029
3399673
4269265
134899
109
161

We don't know anything about the numbered channels but it's clear that channels 7 & 9 are the most popular by far. Several other channels are pretty popular while there are a few which have a nominal number of subscribers associated with them.

To keep our analysis simple, let's eliminate channels 13, 10 & 16 which combined are associated with less than 0.3% of our unique subscribers. Doing this, we can now revisit our survival chart, presenting separate curves for each of the remaining channels:

# eliminate nominal channels
channels_pd = subscriptions_pd[~subscriptions_pd['registered_via'].isin(['10','13','16'])]

# configure the plot
plt.figure(figsize=(12,8))
ax = plt.subplot(111)
plt.title('By Registration Channel', fontsize='xx-large')

# configure the x-axis
plt.xlabel('Timeline (Days)', fontsize='x-large')
plt.xticks(range(0,800,30))

# configure the y-axis
plt.ylabel('Survival Rate', fontsize='x-large')
plt.ylim((0.0, 1.0))
plt.yticks(np.arange(0.0,1.0,0.05))

# graph each curve on the plot
for name, grouped_pd in channels_pd.groupby('registered_via'):
    kmf = KaplanMeierFitter(alpha=0.05)
    kmf.fit(
      grouped_pd['duration_days'], 
      grouped_pd['churned'], 
      label='Channel {0}'.format(name)
      )
    kmf.plot(ax=ax)

Before attempting to interpret these different curves, it's important we evaluate whether they are statistically different from one another. Comparing each curve to the others, we can calculate the probability these curves do not differ from one another using the log-rank test:

NOTE By adding an argument for t_0 to the call below, you can calculate the same metrics for each curve at a specific point in time, instead of across all times as shown here.

log_rank = pairwise_logrank_test(channels_pd['duration_days'], channels_pd['registered_via'], channels_pd['churned'])
log_rank.summary
Out[9]:
log_rank = pairwise_logrank_test(channels_pd['duration_days'], channels_pd['registered_via'], channels_pd['churned'], t_0=184)
log_rank.summary
Out[10]:

Overall and specifically at day 184, the median survival date as identified above, most of these curves is significantly different from one another (as indicated by nearly all p-values being < 0.05). This tells us that the different representations in the chart above are meaningful. But how? Without additional information regarding the numbered channels, it's hard to tell a compelling story as to why some customers see higher attrition than others. Still, KKBox may want to explore why some channels seem to have higher retention rates and examine differences in cost associated with each channel in order to maximize the effectiveness of their customer acquisition efforts.

Now, let's do this same analysis for the payment method used when the subscription was created:

%sql

SELECT
  init_payment_method_id,
  COUNT(DISTINCT msno) as customers
FROM subscription_attributes
GROUP BY init_payment_method_id
ORDER BY customers DESC
411006879
35459019
38350692
40136553
36119644
39106264
32104970
3755817
3438976
2934254
3033063
3327362
2820044
2215887
2015305
3115049
258147
175193
274266
134192
123413
242968
232964
191736
161629
211426
181237
26834
7810
14744
8457
6365
5359
15136
3117
11105
1066
231
415

The number of payment methods in the dataset is quite large. Unlike with registration channels where there was a clear delineation between the popular and the unpopular channels, the decline in subscription counts with payment methods is more gradual. With this in mind, we'll set an arbitrary cutoff of 10,000 members associated with a payment method for it to be included in our analysis:

payment_methods_pd = spark.sql('''
  SELECT
    duration_days,
    churned,
    init_payment_method_id
  FROM subscription_attributes
  WHERE init_payment_method_id IN (
    SELECT
      init_payment_method_id
    FROM subscription_attributes
    GROUP BY init_payment_method_id
    HAVING COUNT(DISTINCT msno)>10000
    )''').toPandas()
# configure the plot
plt.figure(figsize=(12,8))
ax = plt.subplot(111)
plt.title('By Initial Payment Method', fontsize='xx-large')   

# configure the y-axis
plt.ylabel('Survival Rate', fontsize='x-large')
plt.ylim((0.0, 1.0))
plt.yticks(np.arange(0.0,1.0,0.05))

# configure the x-axis
plt.xlabel('Timeline (Days)', fontsize='x-large')
plt.xticks(range(0,800,30))

# calculate the surival rates 
for name, grouped_pd in payment_methods_pd.groupby('init_payment_method_id'):
    kmf = KaplanMeierFitter(alpha=0.05)
    kmf.fit(
      grouped_pd['duration_days'], 
      grouped_pd['churned'], 
      label='Method {0}'.format(name)
      )
    _ = kmf.plot(ax=ax)
    _.legend(loc='upper right')

Even when focusing just on the more popular methods, the chart above is quite a bit busier than the one before. This is a chart for which we should carefully consider the statistical differences between any two payment methods before drawing too many hard conclusions. But in the interest of space, we'll interpret this chart now, addressing the pairwise statistical comparisons in the cell below.

Without knowledge with which to speculate why, it's very interesting that some methods have very different drop-off rates. For example, compare method 34 at the top of the chart and method 35 at the bottom. Do these different payment methods indicate differences in ability to pay for a service over the long-term, e.g. credit cards associated with higher earners or those with higher credit scores vs. cards associated with lower earners or those with lower scores? Alternatively, could some of these initial payment methods be tied to vouchers where the customer somehow is foregoing payment or receiving the service at a discount during an initial 30-day window. When the customer is then asked to pay the regular price, the customer may be dropping out of the service as they weren't terribly invested in the service from the outset. (It's important to remember we only know the initial payment method used and not subsequent payment methods employed, though that data is presumably in our transaction log dataset.) Again, without more business knowledge, we can only speculate, but given the large magnitude differences show here, this would be an aspect of customer acquisition worth exploring in more detail.

Here is the pairwise comparison, limited to those curves that don't quite differ enough from one another to be considered statistically different:

log_rank = pairwise_logrank_test(payment_methods_pd['duration_days'], payment_methods_pd['init_payment_method_id'], payment_methods_pd['churned'])
summary = log_rank.summary
summary[summary['p']>=0.05]
Out[13]:

From the Log Rank test results, it would appear that payment methods 22 and 32 are statistically undifferentiated (as are 20 and 40).

Next, let's examine the days configured for the payment plan at the initiation of a subscription. This is an odd attribute of the subscriptions as it could be viewed as either continuous or discrete. Let's treat it as descrete here to see how we might handle it in later analysis.

At this point, I assume the pattern for this analysis is familiar:

%sql

SELECT
  init_payment_plan_days,
  COUNT(DISTINCT msno) as customers
FROM subscription_attributes
GROUP BY init_payment_plan_days
ORDER BY customers DESC
301723084
7476867
31226559
41083708
19555299
18023476
1015544
9013218
10012510
3959158
145376
2003958
3603805
603400
1203041
4501814
2401402
3651259
4001164
4151159
451102
1916
21671
270453
35252
70227
80144
99103
6665
23063
249
335
11027
1517
4251
951
payment_plan_days_pd = spark.sql('''
  SELECT
    duration_days,
    churned,
    init_payment_plan_days
  FROM subscription_attributes
  WHERE init_payment_plan_days IN (
    SELECT
      init_payment_plan_days
    FROM subscription_attributes
    GROUP BY init_payment_plan_days
    HAVING COUNT(DISTINCT msno)>10000
    )''').toPandas()
   
plt.figure(figsize=(12,8))
ax = plt.subplot(111)
plt.title('By Initial Payment Plan Days', fontsize='xx-large')

plt.ylabel('Survival Rate', fontsize='x-large')
plt.ylim((0.0, 1.0))
plt.yticks(np.arange(0.0,1.0,0.05))

plt.xlabel('Timeline (Days)', fontsize='x-large')
plt.xticks(range(0,800,30))

# calculate the surival rates 
for name, grouped_pd in payment_plan_days_pd.groupby('init_payment_plan_days'):
    kmf = KaplanMeierFitter(alpha=0.05)
    kmf.fit(
      grouped_pd['duration_days'], 
      grouped_pd['churned'], 
      label='Days {0}'.format(name)
      )
    _ = kmf.plot(ax=ax)
    _.legend(loc='upper right')

It appears as if most customers make it to their first renewal and then there is a sizeable drop. Not every plan sees the same rate of drop-out at that time with the 7-day and 10-days plans having massive drop-out at the first renewal compared to the 30-day plan which KKBox indicates is the traditional plan. Also interesting is that the 100-day plan sees a steep drop in customers at it's first renewal point (with it's survival rate following below that of the 30-day subscribers) while the 90-day subscribers follow a very different trajectory until later in the subscription lifecycle.

While the confidence intervals on these curves are more visible than some of the earlier ones, statistically, all the curves are significant:

log_rank = pairwise_logrank_test(payment_plan_days_pd['duration_days'], payment_plan_days_pd['init_payment_plan_days'], payment_plan_days_pd['churned'])
summary = log_rank.summary
summary[summary['p']>=0.05]
Out[15]:

Lastly, let's consider whether subscribers who previously churned might follow different paths from those who have only had a single subscription:

plt.figure(figsize=(12,8))
ax = plt.subplot(111)
plt.title('By Prior Subscription Count', fontsize='xx-large')

plt.ylabel('Survival Rate', fontsize='x-large')
plt.ylim((0.0, 1.0))
plt.yticks(np.arange(0.0,1.0,0.05))

plt.xlabel('Timeline (Days)', fontsize='x-large')
plt.xticks(range(0,800,30))
 
for name, grouped_pd in subscriptions_pd.groupby('prior_subscriptions'):
    kmf = KaplanMeierFitter(alpha=0.05)
    kmf.fit(
      grouped_pd['duration_days'], 
      grouped_pd['churned'], 
      label='Prior Subscriptions {0}'.format(name)
      )
    kmf.plot(ax=ax)
log_rank = pairwise_logrank_test(subscriptions_pd['duration_days'], subscriptions_pd['prior_subscriptions'], subscriptions_pd['churned'])
summary = log_rank.summary
summary[summary['p']>=0.05]
Out[17]:

As the number of prior subscription increases, the number of subscribers falling into each category declines leading to larger and larger confidence intervals. There does appear to be a general pattern of subscribers with a few prior subscriptions being retained at higher rates but only up to a point. And then it appears the number of priors doesn't really offer any protection from drop-out. The lack of statistical significance would encourage us not to put too much emphasis on this narrative but still it might be interesting for KKBox to examine.