Simple yet effective RFM

Looking for a quick and dirty way to segment your customer data that does not require machine learning? In this post we will take a look at a python implementation of a popular technique used by marketers to divide customers into buckets of high, medium, and low value customers. The technique is called RFM.

What is RFM?

RFM quantitively ranks and groups customers based on three factors:

How recent was the customer's last purchase. The more recently a customer made a purchase, the more likely that customer will make a purchase again as he/she would likely still have the product on their mind.

How frequently did the customer purchase products. Frequent patrons are likely to purchase again in the future and are therefore valuable to the business.

The amount of money the customer spent. Customers who spent a lot of money during a particular period of time are very valuable to the business, as they may make another purchase in the future.

As you may have guessed RFM is short for recency, frequency, and monetary. Here's how it works:

  1. First, rate each customer on each of the three factors. Generally a scale of 1 to 5 is used, with 5 being the highest possible score. However, you are free to choose a different scaling that makes better sense for your problem.
  2. Next step, average each score together to compute an RFM score. The higher the score is for a customer the more valuable the customer is. Alternatively, you can compute a weighted average of the three scores if certain factors in your problem are more important than others.
  3. Lastly, group each customer into high, medium, and low value buckets based on the RFM score.

RFM in action

Now that you're acquainted with RFM, we'll apply it to some customer data so you can see it in action. The dataset we'll use is from an UK based online retail store.

We'll start by reading the provided CSV file and inspecting the data attributes.

transactions_df = pd.read_csv('data.csv',encoding='ISO-8859-1')
transactions_df.info()

Here's what you'll see as output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB

For this exercise, we will be focusing on the InvoiceNo, Quantity, InvoiceDate, UnitPrice, and CustomerID attributes.

The CustomerID attribute uniquely identifies an online store customer.

The InvoiceNo, and InvoiceDate attributes uniquely identifies a transaction performed by a customer and the date it was performed.

A transaction in this dataset is composed of one or more items purchased at various quantities. The Quantity and UnitPrice attributes contains the amount of an item purchased and the cost per item.

Looking at the output above, it appears that the CustomerID attribute contains missing values. We will proceed by discarding the rows with the missing values from the dataset.

transactions_df = transactions_df[transactions_df['CustomerID'].notnull()].copy()

Next, we will convert the InvoiceDate attribute into a datetime attribute to make it easier to work with.

transactions_df['InvoiceDate'] = pd.to_datetime(transactions_df['InvoiceDate'])

Looking at the transaction data, it looks like there's some transactions with negative quantities. These are likely refunds made to the customer. We will discard these transactions from our analysis as well.

transactions_df = transactions_df[transactions_df['Quantity'] > 0]

Next up, we will create three new pandas dataframes containing computed recency, frequency, and monetary values for each customer.

We will make use of the newly converted InvoiceDate attribute to obtain the most recent transaction for each customer and compute the number of days since the latest transaction in the entire dataset.

# Get the most recent transaction in the entire dataset
most_recent_transaction = transactions_df['InvoiceDate'].max()

# Compute the lastest transaction for each user
latest_transactions_per_user = transactions_df.groupby('CustomerID')['InvoiceDate'].max()

recency_df = latest_transactions_per_user.reset_index()
recency_df['recency'] = recency_df['InvoiceDate'].apply(lambda date: (most_recent_transaction - date).days)

# Inspect the results
recency_df.head()
recency values per customer

Next we will create a dataframe containing the number of reqeat transactions made by each customer. The code snippet below accomplishes this by computing the number of unique invoices produced for each customer.

# Calculate the number of invididual invoices generated for each customer
num_transactions_per_user = transactions_df.groupby('CustomerID')['InvoiceNo'].unique().apply(lambda lst: len(lst))

frequency_df = num_transactions_per_user.reset_index().rename(columns={'InvoiceNo':'frequency'})

# Inspect the results
frequency_df
frequency values per customer

Finally we will calculate the total spend for each customer.

# Create a new column containing the total amount spent for each item
transactions_df['Total'] = transactions_df['Quantity'] * transactions_df['UnitPrice']

# Now, aggregate total spend by customer
monetary_df = transactions_df.groupby('CustomerID')['Total'].sum().reset_index().rename(columns={'Total':'monetary'})

monetary_df
monetary value per customer

Let's merge the recency, frequency, and monetary dataframes into one dataframe.

rfm_df = pd.merge(pd.merge(recency_df, frequency_df, on='CustomerID'), monetary_df, on='CustomerID').drop('InvoiceDate',axis=1)
combined recency, frequency, and monetary values

We'll now rank the recency, frequency, and monetary values on a scale of 1 to 5. We will use the pandas cut function can be used to bin each value into 5 evenly spaced intervals. You can also specify your own binning intervals you'd like cut to use as well.

rfm_df['recency_score'] = pd.cut(rfm_df['recency'], bins=5, labels=[5,4,3,2,1]).astype(int)
rfm_df['frequency_score'] = pd.cut(rfm_df['frequency'], bins=5, labels=[1,2,3,4,5]).astype(int)
rfm_df['monetary_score'] = pd.cut(rfm_df['monetary'], bins=5, labels=[1,2,3,4,5]).astype(int)
rankings per customer

Using the recency, frequency, and monetary ratings we will create an RFM score. We will calculate scores for each user by computing a weighed average of the three rankings.

rfm_df['rfm_score'] = (0.35*rfm_df['recency_score'] + 0.35*rfm_df['frequency_score'] + 0.40*rfm_df['monetary_score'])
rfm_df['rfm_score'] = rfm_df['rfm_score'].round(2)
RFM scores per user

Now that the RFM scores are computed for each customer, let's form three segments of high, medium, and low value customers.

  • We'll define high tier customers as customers with RFM scores greater than 3
  • Medium tier customers will be defined as customers with RFM scores between 2 and 3
  • Customers with RFM scores less than 2 will be placed in the low tier bucket.
def segment_customer(score):
    if score >= 3:
        return 'High'
    elif score >= 2 and score < 3:
        return 'Medium'
    else:
        return 'Low'

rfm_df['segment'] = rfm_df['rfm_score'].apply(segment_customer)
Customer segments

We got our segments! We can aggregate our customers by the segments by see what what it looks like.

rfm_df.groupby('segment')['CustomerID'].count()

The output...

segment
High        16
Low       1041
Medium    3282
Name: CustomerID, dtype: int64

That's all folks!

RFM is a simple and quick way to segment your customers into buckets that you can quickly take action on. If you're looking for segmentation techniques that utilize additional customer attributes, check out some of the other methods I discuss on the blog such as K-means. You can find find the entire code for this post here. Until next time!