Will you redeem a cup of coffee?

7 min readMar 24, 2021

Project Introduction

Starbucks at the corner of the street might be one of the most common but important place in our life. It is possible that any of us has entered in a Starbucks and get a cup of coffee at some time. For some people, this is a must do thing every day. Lots of coffee lovers have been the member of Starbucks and received offers to redeem a cup of coffee or get reward for next time. However, not every offer has been successfully completed. While some people may not interest in becoming a member or receiving an offer, but they still purchase coffee frequently. So, what factors predict if the customer would redeem an offer? In those who purchase coffee without an offer, what factors predict their average transaction? In my Capstone Challenge for Udacity’s Data Scientist Nanodegree, we obtained the simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks. Not all users receive the same offer.

Data Description

The current project includes three datasets:

portfolio.json

· id (string) — offer id

· offer_type (string) — type of offer ie BOGO, discount, informational

· difficulty (int) — minimum required spend to complete an offer

· reward (int) — reward given for completing an offer

· duration (int) — time for offer to be open, in days

· channels (list of strings)

profile.json

· age (int) — age of the customer

· became_member_on (int) — date when customer created an app account

· gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)

· id (str) — customer id

· income (float) — customer’s income

transcript.json

· event (str) — record description (ie transaction, offer received, offer viewed, etc.)

· person (str) — customer id

· time (int) — time in hours since start of test. The data begins at time t=0

· value — (dict of strings) — either an offer id or transaction amount depending on the record

Project goals and hypothesis

It is true that the current dataset has lots of possibilities, and we can perform various of analyses. While, after reviewing the data, I aim to explore the following three major questions:

Q1. How does the number of new member change along with time?

· My hypothesis is that the member number is increasing every year.

Q2. For those customers who didn’t receive offer but purchase coffee through transaction, which demographic factor predicts their average transaction?

· My hypothesis is the annual income might be the most important predictor.

Q3: For the offers related data, which factors predict whether the offer was eventually successfully completed?

· To be honest, it is hard to have a precise hypothesis on this question, because there are lots of predictors, but I still hypothesize that the given reward and offer type might be the most important ones.

Cleaning and visualizing the data

· Portfolio dataset

Portfolio dataset provides the detailed information of the 10 different offers (i.e., reward, channel, difficulty, duration, offer type, offer id).

This dataset is simple, but I need create the dummy columns for the categorical variables and for the further analysis. The processed data is shown in Figure 1:

Figure 1. screenshot of portfolio dataset

· Profile dataset

The profile dataset provides the detailed information of customers (i.e., age, gender, income, the time became the member, and annual income)

First, I dropped the null values. Next, I did some brief exploratory analysis to visualize the demographic distribution (Figure 2).

Next, I categorize the age and income into different groups for the further analysis:

Finally, I extract of the year from the became_member_on for the further analysis

After cleaning the profile dataset, here I can explore my first question:

Q1. How does the number of new member change along with time (Figure 3)?

Figure 3. The number of new members by year and age

As shown in Figure 3, we can see that from 2013 to 2018, the number of new members is increased across all age groups (Given the data of 2018 was not the full year, so I assumed it is still increased in 2018). Specifically, the member number in the 40–60 years age group is the highest.

· Transcript dataset

This dataset is the most complicated one. To clean this dataset, I mainly crated dummy columns for the offer type and offer values, and refill the Nan variables with 0 for the further analysis (Figure 4).

Merging data

After cleaning the three datasets, it is ready to merge all three datasets together by customer_id and offer_id, and dropping the unuseful columns.

Machine learning analysis

Now, it is ready to investigate the Question 2 & 3.

Q2. For those customers who didn’t receive offer but purchase coffee through transaction, which demographic factor predicts their average transaction?

For Q2, I am only interested in those transaction items, so I just extract out the subset from the merging data (Figure 5)

Figure 5. Screenshot of the customer_transaction dataset

Metrics:

The average transaction amount is the response value (y), while the demographic information (i.e. age_group, income_group, and gender) are the predictors (x).

ML model:

The Linear regression model was applied to explore the question.

Results:

The rsquared on the training data was 0.252. The rsquared on the test data was 0.237. The coefficient values are shown below:

It is interesting that income is not the most important factor predicting the average transaction amount, but gender is the most important one. Given this, I would like to explore which gender group has the highest transaction. Since I am still interested to know the influence of income, so I checked both factors together (Figure 6).

Figure 6. Average transaction by income and gender

Summary:

There is a clear trend that the female has the higher transaction than other two gender group;
Although there is a little bit difference between gender group, overall, the higher income has the higher transaction

Q3: For the offers related data, which factors predict whether the offer was eventually successfully completed?

The most complicated part of the offer dataset is to categorize the offer complete type. After inspecting the data and description, I categorize the offers into there are three validity types:

· Expired offer: The offer was not opened or completed during the valid duration

· Invalid offer: Some offers were recorded as “completed” within the duration, but the offer was not actually viewed

· Valid offer: The received offers were viewed and completed during the valid duration

Metrics:

Offer validity (i.e., expired, invalid, valid) is the response value (y), the other variables are predictors X (i.e. duration, channel type, offer type, discount, complete_reward, difficulty, offer_reward, gender, age, income)

ML models:

I’ applied three different models (i.e. Logistic regression; RandomForest; DecisionTree) and compared the results.

Results:

As shown in the above Table 2, the DecisionTreeClassfier model provides the best results. The top 5 predictors were list below:

Figure 7. Top 5 predictors of the DecisionTree model

Conclusions:

For transactions:

There is a clear trend that the female has the higher transaction than other two gender group;
Although there is a little bit difference between gender group, overall, the higher income group has the higher transaction.

For offers:

Three models all provide fair to good results, but the DecisionTree model generated the best result;
According to the above results, we can see that the reward given by completing the offer provides is the most important predictor of the offer complete.
Among the three offer type (i.e.informational, discount, bogo), the informational type offer is more likely to be completed.
The social channel might have the better effect on the offer complete
The longer duration of the offer might result in more complete.

Limitations:

· It was kind of arbitrary when separating the income and age group, which might influence the regression analysis results;

· In further analysis, more statistics analysis should be included (e.g. if there is significant difference in transaction between different income groups? if there is interaction effect between income and gender etc.)

· Recommendation analysis can be considered in future.