Starbucks Reward Data Analysis

Today I was delighted to submit my capstone project as a part of ML engineer nanodegree in Udacity. The dataset I analyzed was the Starbucks rewards dataset.


In the Starbucks rewards dataset, there are three types of data present. This project aims to utilize the three types of data and combine them to generate some analysis regarding the differences in customers and types of offers that impact offers being viewed and not just received and offers being completed and not just viewed. The analysis has been carried on many parameters such as the age of the customers, their membership starting date, their income, the type of the offers that are being circulated, the efforts required to complete the offers (measured by difficulty and duration), and so on. This analysis will hopefully bring some interesting trends for the benefit for people or advertisers who circulate the offer.

Instead of discussing the whole code, we will discuss the techniques that help me in the analysis. The full code can be found here.

Let's begin with understanding and basic pre-processing of our data

Portfolio data: Contains information regarding different types of offers that can be sent to the users of Starbucks rewards application.

  • Channels: The medium through which offers are sent
  • Difficulty: How much a customer needs to spend in order to avail the offer
  • Duration: Number of days the offer will be open to the customers
  • Id: The offer ID, to uniquely identify what characteristics an offer has.
  • Offer_type: Whether the offer is Buy One Get One (BOGO) or informational or discount
  • Reward: The amount that the customer will receive after completing the offer

Profile data: Containing information related to the customers.

  • Age: The age of the customer
  • Gender: Male/Female
  • Became_member_on: The joining date of the customer on the application.
  • Id: Customer ID, each customer has a unique customer ID

Transcript data: This is a record directory of the events that take place in the platform. Offers being received, viewed, and completed are recorded along with the paid transactions that happen through the application.

  • Event: Offer received/ Offer viewed/ Offer Completed/ Transaction
  • Person: The customer ID that was involved in the event
  • Time: Number of hours that have passed since the beginning of the test
  • Value: If the events are: offer received or viewed, the value contains the offer id of the transaction. If the offer is completed, then the value contains the offer id along with the reward value. Also, if the transactions are happening then the value field shows the amount that has been transacted.

So, let's dive in to know how I approached this analysis, I have used Pandas and NumPy throughout the project, so make sure you import it:

I began by reading the JSON files:

portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)

Now, we have to rename certain columns for avoiding the confusion and be more clear. For example, we can change the column title ID in the portfolio to make it clear that is Offer ID and not customer ID by simply:


If you have multiple but fixed number of values in a column then you can use the panda function get_dummies for generating one-hot encoded columns. This is done to avoid any interaction with string data as row values. For example, I generated 3 columns out of offer_type, that can be: BOGO (Buy One Get One), discount, or informational.

offertype_data = pd.get_dummies(portfolio['offer_type'])

To deal with data in which more than one value is present in a single column, we use multi-label binariser, an example of it can be:

from sklearn.preprocessing import MultiLabelBinarizer
binariser = MultiLabelBinarizer()['channels']) # channels[0]= [email,mobile,social]
channels_data = pd.DataFrame(binariser.transform(portfolio['channels']),columns=binariser.classes_)

Concat all the columns that we created:

portfolio = pd.concat([portfolio, offertype_data,channels_data], axis=1)

and drop the unnecessary columns:

portfolio = portfolio.drop(columns=['id','offer_type','channels'])

for generating a final data-frame, you can have a concise column list

columns_list = ['offerid','difficulty', 'duration','reward']

portfolio = portfolio[columns_list]

So, this simple processing can transform the portfolio data as shown here:



Similarly, basic pre-processing and cleaning is applied to profile and transcript data.

To begin with, we can clean up the data where values are missing or are not useful, for example, in profile data, we don't want rows having missing income values or 'Other' as gender.

profile = profile[profile['income'].notnull()]
profile = profile[profile['gender'] != 'O']

In profile, Labelbinariser was used instead of multi-label so that it can transform the binary form of gender data (male/female) to 0 and 1 as the numbers are easier to handle.

binariser = LabelBinarizer()
profile['gender'] = binariser.fit_transform(profile['gender'])

In profile, Labelbinariser was used instead of multi-label so that it can transform the binary form of gender data (male/female) to 0 and 1 as the numbers are easier to handle.

Along with this, we transformed the age and joining date column with columns that are an aggregate (to have weightage) and informative.

As an example, we divided the age into four groups: old, mature, millennial and teen.

def define_age_group(age):
:param item: each element of the column of the dataframe
:return: transformed age group: old/matured/millenial/teen
if age>70:
return "old_age"
elif age>30:
return "matured_age"
elif age>20:
return "millenial_age"
return "teen_age"
profile['age'] = profile['age'].apply(define_age_group)
age_desc = pd.get_dummies(profile['age'])

This is then one-hot encoded, as shown above, and appended in the columns as done previously with portfolio data.

In transcript data, we extract the offer ids from value column and rename the column to offer IDs. We also remove any rows that are having a transaction as an event because, in that, we don't have an associated offer value:

So transformed profile data is now:


and transformed transcript data is:


To properly analyse the influence of characteristics of customers, and types of offers on the viewing and completing, we need to join the data frames. We will join the data frames using offer ids and customer ids

combine_profile = transcript.join(profile.set_index(['customerid']), on=['customerid'])
combine_all = combine_profile.join(portfolio.set_index(['offerid']), on=['offerid'])

Our data frame will then have the following columns:

['event', 'customerid', 'offerid', 'gender', 'income', 'established_customers', 'old_customers', 'recent_customers', 'matured_age', 'millenial_age', 'old_age', 'teen_age', 'difficulty', 'duration', 'reward', 'bogo', 'discount', 'informational', 'email', 'mobile', 'social', 'web']

To carry out our analysis which is divided into two parts:

  • Receiving vs Viewing Offers: This will show the offers that have been received and the offers that have been viewed after being received
  • Viewing vs Completing the offers: This will show the offers that have been viewed and the offers that have been completed after being viewed

We construct two data frames for this purpose. We first make another column: id that contains: customerid+offerid to uniquely identify an offer event. We then see what all offers have been viewed by the customers and how many of them have been received but not yet viewed. Similarly, for making the second data frame, we will see how many customers have completed a particular offer, provided that they did view the offer initially.

# we need to have a unique id: customer+offer id to be able to properly create dataframes
combine_all['id'] = combine_all['customerid']+combine_all['offerid']
viewed_offers = combine_all[combine_all['event']=='offer viewed']
received_offers = combine_all[combine_all['event']=='offer received']
completed_offers = combine_all[combine_all['event']=='offer completed']

only_received = received_offers[~received_offers['id'].isin(viewed_offers['id'])]
only_viewed = viewed_offers[~viewed_offers['id'].isin(completed_offers['id'])]
viewed_completed = completed_offers[completed_offers['id'].isin(viewed_offers['id'])]

So the viewed_received_df contains offers that are only received and not viewed and the offers that are viewed after being received, and

the completed_viewed_df contains offers that are only viewed and not completed and the offers that are completed after being viewed.

viewed_received_df = viewed_offers.append(only_received, ignore_index=True)
completed_viewed_df = viewed_completed.append(only_viewed, ignore_index=True)

We can now carry out our analysis.

Analysis Part 1- Impact of different factors on offer views.

There are various factors such as gender, and age that can be explored as a part of this analysis. To give you a headstart on how we can analyse different factors influencing the views an offer get, here is an example.

We will take a look at what type of channels generate more views. For this, first we need to have a count of what channel has how many offers marked "received but not viewed" and how many offers marked "viewed". To do this, we proceed as follows:

Take a count of each of the column values (0/1, as its one-hot encoded)

email_count = viewed_received_df.groupby(["event", "email"]).size()
mobile_count = viewed_received_df.groupby(["event", "mobile"]).size()
social_count = viewed_received_df.groupby(["event", "social"]).size()
web_count = viewed_received_df.groupby(["event", "web"]).size()
As these are hot encoded vectors, we are only interested in values where the values are '1', so we create a list where first element represents number of offers through that channel in event '0' that corresponds to offer being only received and event '1' that corresponds to offers being viewed as well. 
email_num = [email_count[0][1], email_count[1][1]]
mobile_num = [mobile_count[0][1], mobile_count[1][1]]
social_num = [social_count[0][1], social_count[1][1]]
web_num = [web_count[0][1], web_count[1][1]]

events = ["Received","Received & Viewed"]

We then use with bottom values to generate stacked bars., email_num, bottom=mobile_num, color='grey', label='Email'), mobile_num, bottom=social_num, color='blue', label='Mobile'), social_num, bottom=web_num, color='green', label='Social'), web_num, bottom=None, color='purple', label='Web')

The plot generated is shown below:


Interesting analysis from this plot:

  1. Email is not as effective as other channels for sending the offers
  2. Mobile is the channel where almost every offer is viewed
  3. More Offers can be experimented on social sites more to see the results as social platforms generated good views.

Similarly, we can analyse the impact of other factors (features) in our data frame to understand their influence. The plots for other factors can be found here.

Analysis Part 2: Impact of different factors on offer completion (after being viewed)

Similar to the previous analysis, let's take the example of gender in this case.

#gender impact on completion of offer
gender_count = completed_viewed_df.groupby(["event", "gender"]).size()
female_stats = [gender_count[0][0],gender_count[1][0]]
male_stats = [gender_count[0][1],gender_count[1][1]]
events = ["Viewed","Viewed & Completed"],female_stats , bottom=None, color='pink', label='F'), male_stats, bottom=[gender_count[0][0],gender_count[1][0]], color='blue', label='M')

The generated plot is shown:


It can be seen that males tend to complete the offer after viewing more likely than females.

We can go forward and have our own analysis related to different factors. As a part of my project, I was able to draw the following conclusions:

  1. The millennial age group has never shown a decrease in offers viewed as compared to offers received or offers completed to offers viewed, if certain offers are to be sent to a limited audience, people between 20-30 years of age should be targeted.
  2. Mobile platforms have shown the highest ratio in terms of offer completion. It should be always used for publishing offers.
  3. To enable maximum viewing, social platforms should be utilized.
  4. In order to boost the completion rates of the offer, offers with less difficulty and duration should be published more often.

Leave a Reply

Your email address will not be published. Required fields are marked *