How much will you spend on Starbucks products?

Reza Dwi Utomo
16 min readDec 6, 2021

--

Data Scientist Nanodegree Capstone Project

Photo by RR Abrot on Unsplash

This article is for the Starbucks Capstone Challenge of the Data Scientist Nanodegree in Udacity. The dataset provided are simulated data that mimics customer behavior on the Starbucks rewards mobile app, i.e. whether the customer would buy a Starbucks if he/she is provided any informational or discount offer.

Business Understanding

Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offers during certain weeks. Every offer has a validity period before the offer expires.

Based on the information given above, we can infer that any transaction made by a customer is something really desired. No matter what offer you’d provide to the customers, still, you will deserve a transaction. Therefore, this project will try to predict how much money a customer will spend.

To solve the problem, we will use all datasets provided.

We will employ several machine learning algorithms. As this will be a regression problem, we will evaluate the algorithms based on several regression metrics, i.e. mean absolute error (MAE), mean squared error (MSE), root mean square error (RMSE), and R2 score.

For more explanations about why those metrics are selected, I copied-pasted the descriptions from this TDS article.

  • Mean squared error

Mean Square Error is an absolute measure of the goodness for the fit. It gives you an absolute number on how much your predicted results deviate from the actual number. You cannot interpret many insights from one single result but it gives you a real number to compare against other model results and help you select the best regression model.

  • Root mean square error

Root Mean Square Error is the square root of MSE. It is used more commonly than MSE because firstly sometimes MSE value can be too big to compare easily. Secondly, MSE is calculated by the square of error, and thus square root brings it back to the same level of prediction error and makes it easier for interpretation.

  • Mean absolute error

Mean Absolute Error is similar to Mean Square Error. However, instead of the sum of square of error in MSE, MAE is taking the sum of the absolute value of error. Compared to MSE or RMSE, MAE is a more direct representation of sum of error terms. MSE gives larger penalization to big prediction error by square it while MAE treats all errors the same.

  • R2 Score

R Square measures how much variability in dependent variable can be explained by the model. R Square value is between 0 to 1 and a bigger value indicates a better fit between prediction and actual value. R Square is a good measure to determine how well the model fits the dependent variables. However, it does not take into consideration of overfitting problem.

Data Understanding

The data is contained in three files:

  • portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
  • profile.json — demographic data for each customer
  • transcript.json — records for transactions, offers received, offers viewed, and offers completed

Here is the schema and explanation of each variable in the files:

portfolio.json

  • id (string) — offer id
  • offer_type (string) — type of offer ie BOGO, discount, informational
  • difficulty (int) — minimum required spend to complete an offer
  • reward (int) — reward given for completing an offer
  • duration (int) — time for offer to be open, in days
  • channels (list of strings)

profile.json

  • age (int) — age of the customer
  • became_member_on (int) — date when customer created an app account
  • gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
  • id (str) — customer id
  • income (float) — customer’s income

transcript.json

  • event (str) — record description (ie transaction, offer received, offer viewed, etc.)
  • person (str) — customer id
  • time (int) — time in hours since start of test. The data begins at time t=0
  • value — (dict of strings) — either an offer id or transaction amount depending on the record

portfolio dataset

Let’s see the information from portfolio dataset.

We can see above that the portfolio dataset consists of 10 rows and 6 columns. No null values were found.

For each row in the channels column, there's a list value. We're going to split each of them and make one-hot encoding for each.

def split_channels(row):
"""used to split value in the channels col"""
channel_list = ['email', 'mobile', 'social', 'web']
output = []
for channel in channel_list:
if channel in row:
output.append(1)
else:
output.append(0)
return pd.Series(output)
# split the channels col
tmp_channels = portfolio.channels.apply(split_channels) \
.rename(columns={0: 'email',
1: 'mobile',
2: 'social',
3: 'web'})
# concat the split values to the main df then drop the previous one
portfolio = pd.concat([portfolio, tmp_channels], axis=1) \
.drop(columns='channels') \
.rename(columns={'id': 'offer_id'})
portfolio

Nice. It’s clean for now, at least. Let’s move on to the profile dataset.

profile dataset

The dataset consists much more rows than the portfolio dataset. The profile dataset also contains many NaNs. The id column looks inconsistent and could refer to ambiguity. Let's rename it.

Let’s see more detailed in the dataset’s null values.

There’s something interesting, that is the percentage of null values in the gender column is exactly equal to that of the income column. Perhaps, every time gender was null, then the income would be null as well. Let's check it.

Correct! The 0 value above means that there is NO single row with gender equal to NaN without income equal to NaN as well. In other words, BOTH gender and income columns will be in NaN at the same time.

  • Check the gender column

Males dominate the percentage.

  • Check the became_member_on column

When we see the data types, the became_member_on column is not in the correct data type. The column should indicate dates, not integer values.

# change data type of 'became_member_on' col
profile['became_member_on'] = pd.to_datetime(profile.became_member_on, format="%Y%m%d")

From, the date data type, let’s derive other features.

profile['registered_year'] = profile.became_member_on.dt.year
profile['registered_month'] = profile.became_member_on.dt.month
profile['registered_day'] = profile.became_member_on.dt.day
profile['registered_weekday'] = profile.became_member_on.dt.weekday
profile['registered_week'] = profile.became_member_on.dt.week
# then drop the original one
profile.drop(columns='became_member_on', inplace=True)
  • Check the age column

There’s a huge outliers for the age approaching 120.

There’s something fishy here. The number of agaes equal to 118 is 2175. Earlier, we found that the number of NaNs for each gender and income column is also 2175. Maybe, they have relationship each other. Let's check it.

Every time a value in gender col is None, the corresponding values in both age and income cols must be 118 and NaN respectively. This pattern might not be a coincidence. This might indicate to those users who registered to the Starbucks app without providing gender, age, and income info. In other words, those values are default values for users with unavailable data.

Therefore, we should get rid of those NaNs.

transcript dataset

The transcript dataset doesn't have any missing value. The value column has dict-like values. We will break it down little bit later. For the person column, to make it more consistent, we're going to rename it to customer_id.

# check unique keys in 'value' col
tmp_value = transcript.value.apply(lambda row: list(row.keys())) \
.apply(lambda row: row[0] if len(row) == 1 else ', '.join(row))
tmp_value.unique()

There are 4 unique keys in the value column. But that should be just three keys as the offer id and offer_id must refer to the same thing.

def split_value_col(dict_row):
"""used to split the 'value' col"""
# get the dict and replace the ' ' in the 'offer id'
dict_row = {k.replace(' ', '_'): v for k, v in dict_row.items()}
available_keys = ['offer_id', 'amount', 'reward'] # set the keys we want# loop through each key to check what key available
output = []
for key in available_keys:
if key in dict_row.keys():
output.append(dict_row[key])
else:
output.append(np.nan)
return pd.Series(output)# split the 'value' col
tmp_value = transcript.value.apply(split_value_col) \
.rename(columns={0: 'offer_id',
1: 'amount',
2: 'reward'})
# and concat it
transcript = pd.concat([transcript, tmp_value], axis=1) \
.drop(columns='value')

Next, check offer_id column where the value is missing.

The offer_id which is not listed in the portfolio dataset (i.e. indicating by NaN) only occurs at the event equal to transaction and the amount has a value. This is understandable since a transaction made by a customer does not mean an offer is provided.

Join all datasets

Now, let’s join all three datasets.

First, let’s join both profile and transcript datasets. The key used here is the customer_id column.

# join both datasets
tmp_df = pd.merge(transcript, profile, how='left', on='customer_id')
tmp_df.head()

Next, let’s join the previous result with the portfolio dataset`.

# change 'reward' col name in order to distinguish it with the 'reward' col in 'transcript' dataset
portfolio.rename(columns={'reward': 'reward_std'}, inplace=True)
# join then sort its values
df = pd.merge(tmp_df, portfolio, how='left', on='offer_id')
df.sort_values(['customer_id', 'time'], inplace=True)

Preprocessing

As explained earlier, we assume that the age equal to 118 means default value for unavailable data. Therefore, we’re going to remove those ages.

# remove incomplete age data
df = df[df.age < 118]

Next, we’re going to transform the categorical columns. We will implement two kinds of categorical encoding, i.e. the label encoding and one-hot encoding. Our strategy is as follows:

label encoding: event and offer_type cols
one-hot encoding: gender col

The description about when we should use either of them over the other is explained clearly in this analyticsvidhya article.

The main reason why we implement label encoding to the offer_type column is because as explained in the beginning, the most offer is BOGO; the least one is informational; while the NaNs values indicate offers for the transaction event. Therefore, we have the ordinal numbers here:

  • high: BOGO
  • medium: discount
  • low: informational
  • unwanted: NaN

For the event column, we're going to implement the same as the offer_type column. The right order descendingly would be offer completed, transaction, offer viewed, and offer received.

# create the encode/decode dict
offer_type_dict_encode = {np.nan: 0,
'informational': 1,
'discount': 2,
'bogo': 3}
offer_type_dict_decode = {v: k for k, v in offer_type_dict_encode.items()}
# create the encode/decode dict
event_dict_decode = pd.Series(df.event.unique()).to_dict()
event_dict_encode = {v: k for k, v in event_dict_decode.items()}
XY = df.copy()# implement label encoding
XY['offer_type'] = XY.offer_type.map(offer_type_dict_encode)
XY['event'] = XY.event.map(event_dict_encode)
# one hot encoding
XY = pd.concat([XY, pd.get_dummies(XY.gender)], axis=1) \
.drop(columns='gender')
XY.head()

For modelling, we don't need identification cols as they cause bias.

cols2drop = [‘customer_id’, ‘offer_id’, ‘reward_std’]
XY.drop(columns=cols2drop, inplace=True)

Our target for modelling will be the amount column. As such a column still has missing values, we're going to remove them.

# remove missing values
XY = XY[XY.amount.notnull()]

For the remaining columns that still have NaNs, we're going to check the number of unique values of each. If any column has only one unique value, we will drop the column.

unimportant_cols = []
for col in XY.columns:
# if only 1 unique value
if XY[col].nunique() < 2:
print(col, XY[col].unique())
unimportant_cols.append(col)
else: # if >= 2 unique values
print(col, XY[col].nunique())

As seen above, we don’t need event, reward, difficulty, offer_type, email, mobile, social, and web columns.

XY.drop(columns=unimportant_cols, inplace=True)

Modelling

Preparing

Let’s see the target distribution.

There’s a huge gap in the distribution. Let’s see it in boxplot.

A lot of outliers are identified. We can also see them by using the quantiles. There’s a big gap between the median and the max value.

# set the upper value of IQR
Q3 = XY.amount.describe()['75%']
Q1 = XY.amount.describe()['25%']
IQR = Q3 - Q1
upper = Q3 + (1.5*IQR)
# see the proportion of outliers
XY[XY.amount > upper].amount.shape[0] / XY.shape[0] * 100

Fortunately, the proportion is small, i.e. smaller than 1%. So, it’s not a problem to remove the outliers. Let’s remove those outliers.

# remove outliers
XY = XY[XY.amount <= upper]

Now, it’s much better. Next, let’s see the correlation among all features.

# Generate a mask for the upper triangle
XY_corr = XY.corr()
mask = np.triu(np.ones_like(XY_corr, dtype=bool))
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(XY_corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})

Nice. No multicollinearity found in the dataset, except the correlation between M and F. This occurs since both share the same value, i.e. either 0 or 1.

However, what makes challenging here is that each correlation between amount and other columns doesn't have strong value. This will make our models struggling to find accurate prediction.

Next, divide the final dataset into the input features and target label. Then, split those into train set and test set.

X = XY.drop(columns='amount').values
y = XY.amount.values
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.075,
random_state=0)
# check the proportion
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)

Build models

We will use 4 types of models, i.e. Random Forest, Gradient Boosing, Auto Sklearn, and Neural Network. For the first two models, we’re going to use scikit-learn library. For the third one, Auto Sklearn library will be used. This library will automate many algorithms from scikit-learn in order to find the best composed algorithm for your problem. Finally, we will implement a neural network with middle-deep network. The library used will be PyTorch.

  • Random Forest
# instantiate the model. just use the default parameters
regressor = RandomForestRegressor(random_state=0)
regressor.fit(X_train_scaled, y_train) # train the model
  • Gradient Boosting
# instantiate the model. just use the default parameters
regressor_GB = GradientBoostingRegressor(random_state=0)
regressor_GB.fit(X_train_scaled, y_train) # train the model
  • Auto Sklearn
automl = AutoSklearnRegressor(time_left_for_this_task=3600,
per_run_time_limit=600, resampling_strategy='cv', resampling_strategy_arguments={'folds': 4}, seed=0, tmp_folder='/tmp/autosklearn_regression_tmp')
# this will take up to 1 hour as time_left_for_this_task equal to 3600
automl.fit(X_train_scaled, y_train, dataset_name='starbucks_predict_amount')

Afterwards, we can see the best model found by Auto Sklearn.

print(automl.leaderboard())

Unfortunately, Auto Sklearn just found 1 best model. If you want Auto Sklearn to find other best model possibilites, you could tune the arguments in the AutoSklearnRegressor() class above.

  • Neural Nets

In order to use PyTorch NN model, we should tune many things. If you want to train the NN model using GPU, but you maybe don't have a CUDA GPU installed on your local computer, then you could use Google Colab as I did.

We’re going use an NN model with architecture as follows:

  • ReLU(n_input, 16)
  • ReLU(16, 64)
  • ReLU(64, 8)
  • Linear(8, n_output)

This architecture is merely arbitrary. You could define better architecture if you want to.

# define layer
class LinearRegression(nn.Module):
def __init__(self, n_feature:int, n_output:int):
super(LinearRegression, self).__init__()
self.hidden1 = torch.nn.Linear(n_feature, 16) # hidden layer 1
self.hidden2 = torch.nn.Linear(16, 64) # hidden layer 2
self.hidden3 = torch.nn.Linear(64, 8) # hidden layer 3
self.predict = torch.nn.Linear(8, n_output) # output layer
def forward(self, x):
# set activation function for hidden layers
x = nn.functional.relu(self.hidden1(x))
x = nn.functional.relu(self.hidden2(x))
x = nn.functional.relu(self.hidden3(x))
y_pred = self.predict(x) # linear output
return y_pred
#Training loop
def train(model: nn.Module,
train_loader: torch.utils.data.DataLoader,
val_loader: torch.utils.data.DataLoader,
n_epochs: int,
lr: float) -> nn.Module:

#Loss and optimizer
criterion = nn.MSELoss(reduction='mean')
optimizer = torch.optim.Adam(model.parameters(), lr=lr)
#Train weigths
for epoch in range(n_epochs):
for inputs, labels in train_loader:
#Zero the parameter gradients
optimizer.zero_grad()
#Forward propagation
pred_y = model(inputs)
#Compute loss
loss = criterion(pred_y,labels)
#Backward propagration to compute gradient
loss.backward()
#Update parameters
optimizer.step()
# Evaluate model on validation data
mse_val = 0
for inputs, labels in val_loader:
mse_val += torch.sum(torch.pow(labels - model(inputs), 2)).item()
mse_val /= len(val_loader.dataset)print(f'Epoch: {epoch + 1}: Val MSE: {mse_val}')return model

Model Evaluation and Validation

As explained earlier, this problem is a regression problem. Therefore, the metrics we used to evaluate all models are those for regression problems. So, we used:

The results from the four models above do not show good performance. Therefore, we should carry out the model improvement.

As Random Forest performs the best so far, it is chosen for model improvement. We’re going to use scikit-learn's GridSearchCV() class. Fortunately, the parameters of the estimator used to apply this method are optimized by cross-validated grid-search over a parameter grid.

In this case, I only specify combinations for 2 hyperparameters, i.e. n_estimators and bootstrap.

PS. Actually, I've tried several times the grid search trainings with the combinations of three, five, and six hyperparameters. But, my local computer always hangs every time I run a lot of grid search parameters combinations. I have tried a few solutions for this issue, i.e. this solution, this solution, and this solution. I also have reduced the n_jobs argument in GridSearchCV() to only 1. But, the issue resists so that I decide to reduce the number of hyperparameters to only two. Thankfully, the issue is solved when I only use combinations for 2 hyperparameters with only 1 n_jobs and use Pipeline() class.

If you have a machine with more computing resources, you could run much more hyperparameter combinations, and even more regressors.

# instatantiate model
random_forest = RandomForestRegressor(random_state=1)
class DummyEstimator(BaseEstimator):
"""used to make a dummy estimator for Pipeline"""
def fit(self): pass
def score(self): pass
# set pipe
pipeline = Pipeline([
('regressor', DummyEstimator())
])
# set parameter grid
parameters = {'regressor': [random_forest],
'regressor__bootstrap': (True, False),
'regressor__n_estimators': (100, 150)
}
# instantiate grid search
cv = GridSearchCV(pipeline, param_grid=parameters, verbose=2)
# see all parameters defined
cv.__dict__

From the hyperparameter combinations we defined above, we expect to have a better model for our problem in this project. Let’s train the grid search model.

# train the grid search model
cv.fit(X_train_scaled, y_train)

In case you’d like to save and load the models trained using Grid Search:

import joblib# save all models from grid search
joblib.dump(cv, 'model/random_forest_cv.pkl')
# or just save the best model
joblib.dump(cv.best_estimator_,
'model/random_forest_cv_best.pkl',
compress = 1)
# load model for further usage
joblib.load("model/random_forest_cv.pkl")
joblib.load("model/random_forest_cv_best.pkl")

The GridSearchCV is trained on the train set. As mentioned by the training result above, there are 20 fits executed in which the number comes from 4 model candidates times 5 folds. Let’s see the grid search results. The results are ranked from the best to the worst.

cv_results = pd.DataFrame(cv.cv_results_)
cv_results = cv_results.sort_values('rank_test_score') \
.reset_index(drop=True)
cv_results.head()
print('Best Random Forest by GridSearchCV:', end='\n\n')best_one = cv_results.loc[0, :]
for col, val in best_one.iteritems():
if isinstance(val, dict):
continue
else:
print(f"{col}: {val}")

Then, let’s test the model on test set.

Take away points

From all models, we can see that each model does not perform good results. Random Forest provides the best performance in this case so far. Even, the NN model which should give better performance is unable to deliver the best performance.

After we improve the Random Forest model by using Grid Search backed by 5-fold Cross-Validation, resulting in 20 times fits, it is expected the model would have much better performance. However, sadly, its performance just improves slightly.

Conclusions

Reflection

I found this project challenging, mainly due to the structure of the data in the transcript dataset. My goal is to predict the amount that will be spent by a customer. So, the decision-makers would know how customers behave in purchasing Starbucks products. Therefore, I need all rows with amount existing, i.e. the rows with event col equal to transaction. However, using only this subset of the dataset, in fact, the models I've implemented can't give good performance. Even, by checking the correlations among all columns, we are unable to capture any strong relationship between the amount and other columns.

Perhaps, in order to get broader factors, we should consider the other events, i.e. offer received, offer viewed, and offer completed. But, if we include the three, we will have many missing values. This is contradictory.

Potential improvements

Due to time reasons, I don’t have a chance to perform enhancement on data or model tuning. For example, I could do more experiments on feature engineering to see if adding/substracting features can improve the model; or I could try other combinations of model hyperparameters to see if this will affect the model performance.

In addition, the analysis and modelling should also consider the condition of customers who don’t make any transaction as such a condition also provide more insight.

Last but not least, the computing resources should also be considered. In my case, where I need GPU computing, I have to go to Google Colab. But unfortunately, when I need more CPU computings for Grid Search, I ‘ve found Google Colab doesn’t provide multiprocessing. Likewise, my local computer also doesn’t give sufficient multprocessing computing performance for the case. I think, for handling real world data for enterprises or startups, it wouldn’t be excessive to rent any cloud computing.

Acknowledgments

Acknowledgment should go to Starbucks for providing the dataset. This project is a capstone project of Data Scientist Nanodegree on Udacity.

To see more details on the analysis, you can find it on my GitHub repo here.

--

--

Reza Dwi Utomo

An engineer specializing in data-driven analysis | Data Scientist | ML Engineer | AI Enthusiast | Find me on utomoreza.github.io