How much will you spend on Starbucks products?
Data Scientist Nanodegree Capstone Project
This article is for the Starbucks Capstone Challenge of the Data Scientist Nanodegree in Udacity. The dataset provided are simulated data that mimics customer behavior on the Starbucks rewards mobile app, i.e. whether the customer would buy a Starbucks if he/she is provided any informational or discount offer.
Business Understanding
Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offers during certain weeks. Every offer has a validity period before the offer expires.
Based on the information given above, we can infer that any transaction made by a customer is something really desired. No matter what offer you’d provide to the customers, still, you will deserve a transaction. Therefore, this project will try to predict how much money a customer will spend.
To solve the problem, we will use all datasets provided.
We will employ several machine learning algorithms. As this will be a regression problem, we will evaluate the algorithms based on several regression metrics, i.e. mean absolute error (MAE), mean squared error (MSE), root mean square error (RMSE), and R2 score.
For more explanations about why those metrics are selected, I copied-pasted the descriptions from this TDS article.
- Mean squared error
Mean Square Error is an absolute measure of the goodness for the fit. It gives you an absolute number on how much your predicted results deviate from the actual number. You cannot interpret many insights from one single result but it gives you a real number to compare against other model results and help you select the best regression model.
- Root mean square error
Root Mean Square Error is the square root of MSE. It is used more commonly than MSE because firstly sometimes MSE value can be too big to compare easily. Secondly, MSE is calculated by the square of error, and thus square root brings it back to the same level of prediction error and makes it easier for interpretation.
- Mean absolute error
Mean Absolute Error is similar to Mean Square Error. However, instead of the sum of square of error in MSE, MAE is taking the sum of the absolute value of error. Compared to MSE or RMSE, MAE is a more direct representation of sum of error terms. MSE gives larger penalization to big prediction error by square it while MAE treats all errors the same.
- R2 Score
R Square measures how much variability in dependent variable can be explained by the model. R Square value is between 0 to 1 and a bigger value indicates a better fit between prediction and actual value. R Square is a good measure to determine how well the model fits the dependent variables. However, it does not take into consideration of overfitting problem.
Data Understanding
The data is contained in three files:
- portfolio.json — containing offer ids and meta data about each offer (duration, type, etc.)
- profile.json — demographic data for each customer
- transcript.json — records for transactions, offers received, offers viewed, and offers completed
Here is the schema and explanation of each variable in the files:
portfolio.json
- id (string) — offer id
- offer_type (string) — type of offer ie BOGO, discount, informational
- difficulty (int) — minimum required spend to complete an offer
- reward (int) — reward given for completing an offer
- duration (int) — time for offer to be open, in days
- channels (list of strings)
profile.json
- age (int) — age of the customer
- became_member_on (int) — date when customer created an app account
- gender (str) — gender of the customer (note some entries contain ‘O’ for other rather than M or F)
- id (str) — customer id
- income (float) — customer’s income
transcript.json
- event (str) — record description (ie transaction, offer received, offer viewed, etc.)
- person (str) — customer id
- time (int) — time in hours since start of test. The data begins at time t=0
- value — (dict of strings) — either an offer id or transaction amount depending on the record
portfolio
dataset
Let’s see the information from portfolio
dataset.
We can see above that the portfolio
dataset consists of 10 rows and 6 columns. No null values were found.
For each row in the channels
column, there's a list value. We're going to split each of them and make one-hot encoding for each.
def split_channels(row):
"""used to split value in the channels col"""channel_list = ['email', 'mobile', 'social', 'web']
output = []
for channel in channel_list:
if channel in row:
output.append(1)
else:
output.append(0)
return pd.Series(output)# split the channels col
tmp_channels = portfolio.channels.apply(split_channels) \
.rename(columns={0: 'email',
1: 'mobile',
2: 'social',
3: 'web'})# concat the split values to the main df then drop the previous one
portfolio = pd.concat([portfolio, tmp_channels], axis=1) \
.drop(columns='channels') \
.rename(columns={'id': 'offer_id'})
portfolio
Nice. It’s clean for now, at least. Let’s move on to the profile
dataset.
profile
dataset
The dataset consists much more rows than the portfolio
dataset. The profile
dataset also contains many NaNs
. The id
column looks inconsistent and could refer to ambiguity. Let's rename it.
Let’s see more detailed in the dataset’s null values.
There’s something interesting, that is the percentage of null values in the gender
column is exactly equal to that of the income
column. Perhaps, every time gender was null, then the income would be null as well. Let's check it.
Correct! The 0
value above means that there is NO single row with gender equal to NaN
without income equal to NaN
as well. In other words, BOTH gender
and income
columns will be in NaN
at the same time.
- Check the
gender
column
Males dominate the percentage.
- Check the
became_member_on
column
When we see the data types, the became_member_on
column is not in the correct data type. The column should indicate dates, not integer values.
# change data type of 'became_member_on' col
profile['became_member_on'] = pd.to_datetime(profile.became_member_on, format="%Y%m%d")
From, the date data type, let’s derive other features.
profile['registered_year'] = profile.became_member_on.dt.year
profile['registered_month'] = profile.became_member_on.dt.month
profile['registered_day'] = profile.became_member_on.dt.day
profile['registered_weekday'] = profile.became_member_on.dt.weekday
profile['registered_week'] = profile.became_member_on.dt.week# then drop the original one
profile.drop(columns='became_member_on', inplace=True)
- Check the
age
column
There’s a huge outliers for the age approaching 120.
There’s something fishy here. The number of agaes equal to 118 is 2175. Earlier, we found that the number of NaNs
for each gender
and income
column is also 2175. Maybe, they have relationship each other. Let's check it.
Every time a value in gender
col is None
, the corresponding values in both age
and income
cols must be 118
and NaN
respectively. This pattern might not be a coincidence. This might indicate to those users who registered to the Starbucks app without providing gender, age, and income info. In other words, those values are default values for users with unavailable data.
Therefore, we should get rid of those NaNs
.
transcript
dataset
The transcript
dataset doesn't have any missing value. The value
column has dict-like values. We will break it down little bit later. For the person
column, to make it more consistent, we're going to rename it to customer_id
.
# check unique keys in 'value' col
tmp_value = transcript.value.apply(lambda row: list(row.keys())) \
.apply(lambda row: row[0] if len(row) == 1 else ', '.join(row))
tmp_value.unique()
There are 4 unique keys in the value
column. But that should be just three keys as the offer id
and offer_id
must refer to the same thing.
def split_value_col(dict_row):
"""used to split the 'value' col"""# get the dict and replace the ' ' in the 'offer id'
dict_row = {k.replace(' ', '_'): v for k, v in dict_row.items()}available_keys = ['offer_id', 'amount', 'reward'] # set the keys we want# loop through each key to check what key available
output = []
for key in available_keys:
if key in dict_row.keys():
output.append(dict_row[key])
else:
output.append(np.nan)return pd.Series(output)# split the 'value' col
tmp_value = transcript.value.apply(split_value_col) \
.rename(columns={0: 'offer_id',
1: 'amount',
2: 'reward'})# and concat it
transcript = pd.concat([transcript, tmp_value], axis=1) \
.drop(columns='value')
Next, check offer_id
column where the value is missing.
The offer_id
which is not listed in the portfolio
dataset (i.e. indicating by NaN
) only occurs at the event
equal to transaction
and the amount
has a value. This is understandable since a transaction made by a customer does not mean an offer is provided.
Join all datasets
Now, let’s join all three datasets.
First, let’s join both profile
and transcript
datasets. The key used here is the customer_id
column.
# join both datasets
tmp_df = pd.merge(transcript, profile, how='left', on='customer_id')tmp_df.head()
Next, let’s join the previous result with the portfolio
dataset`.
# change 'reward' col name in order to distinguish it with the 'reward' col in 'transcript' dataset
portfolio.rename(columns={'reward': 'reward_std'}, inplace=True)# join then sort its values
df = pd.merge(tmp_df, portfolio, how='left', on='offer_id')
df.sort_values(['customer_id', 'time'], inplace=True)
Preprocessing
As explained earlier, we assume that the age equal to 118 means default value for unavailable data. Therefore, we’re going to remove those ages.
# remove incomplete age data
df = df[df.age < 118]
Next, we’re going to transform the categorical columns. We will implement two kinds of categorical encoding, i.e. the label encoding and one-hot encoding. Our strategy is as follows:
label encoding: event
and offer_type
cols
one-hot encoding: gender
col
The description about when we should use either of them over the other is explained clearly in this analyticsvidhya article.
The main reason why we implement label encoding to the offer_type
column is because as explained in the beginning, the most offer is BOGO; the least one is informational; while the NaNs
values indicate offers for the transaction event. Therefore, we have the ordinal numbers here:
- high:
BOGO
- medium:
discount
- low:
informational
- unwanted:
NaN
For the event
column, we're going to implement the same as the offer_type
column. The right order descendingly would be offer completed
, transaction
, offer viewed
, and offer received
.
# create the encode/decode dict
offer_type_dict_encode = {np.nan: 0,
'informational': 1,
'discount': 2,
'bogo': 3}
offer_type_dict_decode = {v: k for k, v in offer_type_dict_encode.items()}# create the encode/decode dict
event_dict_decode = pd.Series(df.event.unique()).to_dict()
event_dict_encode = {v: k for k, v in event_dict_decode.items()}XY = df.copy()# implement label encoding
XY['offer_type'] = XY.offer_type.map(offer_type_dict_encode)
XY['event'] = XY.event.map(event_dict_encode)# one hot encoding
XY = pd.concat([XY, pd.get_dummies(XY.gender)], axis=1) \
.drop(columns='gender')XY.head()
For modelling, we don't need identification cols as they cause bias.
cols2drop = [‘customer_id’, ‘offer_id’, ‘reward_std’]
XY.drop(columns=cols2drop, inplace=True)
Our target for modelling will be the amount
column. As such a column still has missing values, we're going to remove them.
# remove missing values
XY = XY[XY.amount.notnull()]
For the remaining columns that still have NaNs
, we're going to check the number of unique values of each. If any column has only one unique value, we will drop the column.
unimportant_cols = []
for col in XY.columns:
# if only 1 unique value
if XY[col].nunique() < 2:
print(col, XY[col].unique())
unimportant_cols.append(col)
else: # if >= 2 unique values
print(col, XY[col].nunique())
As seen above, we don’t need event
, reward
, difficulty
, offer_type
, email
, mobile
, social
, and web
columns.
XY.drop(columns=unimportant_cols, inplace=True)
Modelling
Preparing
Let’s see the target distribution.
There’s a huge gap in the distribution. Let’s see it in boxplot.
A lot of outliers are identified. We can also see them by using the quantiles. There’s a big gap between the median and the max value.
# set the upper value of IQR
Q3 = XY.amount.describe()['75%']
Q1 = XY.amount.describe()['25%']
IQR = Q3 - Q1
upper = Q3 + (1.5*IQR)# see the proportion of outliers
XY[XY.amount > upper].amount.shape[0] / XY.shape[0] * 100
Fortunately, the proportion is small, i.e. smaller than 1%. So, it’s not a problem to remove the outliers. Let’s remove those outliers.
# remove outliers
XY = XY[XY.amount <= upper]
Now, it’s much better. Next, let’s see the correlation among all features.
# Generate a mask for the upper triangle
XY_corr = XY.corr()
mask = np.triu(np.ones_like(XY_corr, dtype=bool))# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))# Generate a custom diverging colormap
cmap = sns.diverging_palette(230, 20, as_cmap=True)# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(XY_corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
Nice. No multicollinearity found in the dataset, except the correlation between M
and F
. This occurs since both share the same value, i.e. either 0 or 1.
However, what makes challenging here is that each correlation between amount
and other columns doesn't have strong value. This will make our models struggling to find accurate prediction.
Next, divide the final dataset into the input features and target label. Then, split those into train set and test set.
X = XY.drop(columns='amount').values
y = XY.amount.valuesX_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.075,
random_state=0)# check the proportion
print(X_train.shape, X_test.shape)
print(y_train.shape, y_test.shape)
Build models
We will use 4 types of models, i.e. Random Forest, Gradient Boosing, Auto Sklearn, and Neural Network. For the first two models, we’re going to use scikit-learn
library. For the third one, Auto Sklearn
library will be used. This library will automate many algorithms from scikit-learn
in order to find the best composed algorithm for your problem. Finally, we will implement a neural network with middle-deep network. The library used will be PyTorch
.
- Random Forest
# instantiate the model. just use the default parameters
regressor = RandomForestRegressor(random_state=0)
regressor.fit(X_train_scaled, y_train) # train the model
- Gradient Boosting
# instantiate the model. just use the default parameters
regressor_GB = GradientBoostingRegressor(random_state=0)
regressor_GB.fit(X_train_scaled, y_train) # train the model
- Auto Sklearn
automl = AutoSklearnRegressor(time_left_for_this_task=3600,
per_run_time_limit=600, resampling_strategy='cv', resampling_strategy_arguments={'folds': 4}, seed=0, tmp_folder='/tmp/autosklearn_regression_tmp')# this will take up to 1 hour as time_left_for_this_task equal to 3600
automl.fit(X_train_scaled, y_train, dataset_name='starbucks_predict_amount')
Afterwards, we can see the best model found by Auto Sklearn
.
print(automl.leaderboard())
Unfortunately, Auto Sklearn
just found 1 best model. If you want Auto Sklearn
to find other best model possibilites, you could tune the arguments in the AutoSklearnRegressor()
class above.
- Neural Nets
In order to use PyTorch
NN model, we should tune many things. If you want to train the NN model using GPU, but you maybe don't have a CUDA GPU installed on your local computer, then you could use Google Colab as I did.
We’re going use an NN model with architecture as follows:
- ReLU(n_input, 16)
- ReLU(16, 64)
- ReLU(64, 8)
- Linear(8, n_output)
This architecture is merely arbitrary. You could define better architecture if you want to.
# define layer
class LinearRegression(nn.Module):def __init__(self, n_feature:int, n_output:int):
super(LinearRegression, self).__init__()
self.hidden1 = torch.nn.Linear(n_feature, 16) # hidden layer 1
self.hidden2 = torch.nn.Linear(16, 64) # hidden layer 2
self.hidden3 = torch.nn.Linear(64, 8) # hidden layer 3
self.predict = torch.nn.Linear(8, n_output) # output layerdef forward(self, x):
# set activation function for hidden layers
x = nn.functional.relu(self.hidden1(x))
x = nn.functional.relu(self.hidden2(x))
x = nn.functional.relu(self.hidden3(x))y_pred = self.predict(x) # linear output
return y_pred#Training loop
def train(model: nn.Module,
train_loader: torch.utils.data.DataLoader,
val_loader: torch.utils.data.DataLoader,
n_epochs: int,
lr: float) -> nn.Module:
#Loss and optimizer
criterion = nn.MSELoss(reduction='mean')
optimizer = torch.optim.Adam(model.parameters(), lr=lr)#Train weigths
for epoch in range(n_epochs):
for inputs, labels in train_loader:
#Zero the parameter gradients
optimizer.zero_grad()#Forward propagation
pred_y = model(inputs)#Compute loss
loss = criterion(pred_y,labels)#Backward propagration to compute gradient
loss.backward()#Update parameters
optimizer.step()# Evaluate model on validation data
mse_val = 0
for inputs, labels in val_loader:
mse_val += torch.sum(torch.pow(labels - model(inputs), 2)).item()mse_val /= len(val_loader.dataset)print(f'Epoch: {epoch + 1}: Val MSE: {mse_val}')return model
Model Evaluation and Validation
As explained earlier, this problem is a regression problem. Therefore, the metrics we used to evaluate all models are those for regression problems. So, we used:
The results from the four models above do not show good performance. Therefore, we should carry out the model improvement.
As Random Forest performs the best so far, it is chosen for model improvement. We’re going to use scikit-learn
's GridSearchCV()
class. Fortunately, the parameters of the estimator used to apply this method are optimized by cross-validated grid-search over a parameter grid.
In this case, I only specify combinations for 2 hyperparameters, i.e. n_estimators
and bootstrap
.
PS. Actually, I've tried several times the grid search trainings with the combinations of three, five, and six hyperparameters. But, my local computer always hangs every time I run a lot of grid search parameters combinations. I have tried a few solutions for this issue, i.e. this solution, this solution, and this solution. I also have reduced the
n_jobs
argument inGridSearchCV()
to only 1. But, the issue resists so that I decide to reduce the number of hyperparameters to only two. Thankfully, the issue is solved when I only use combinations for 2 hyperparameters with only 1n_jobs
and usePipeline()
class.
If you have a machine with more computing resources, you could run much more hyperparameter combinations, and even more regressors.
# instatantiate model
random_forest = RandomForestRegressor(random_state=1)class DummyEstimator(BaseEstimator):
"""used to make a dummy estimator for Pipeline"""
def fit(self): pass
def score(self): pass# set pipe
pipeline = Pipeline([
('regressor', DummyEstimator())
])# set parameter grid
parameters = {'regressor': [random_forest],
'regressor__bootstrap': (True, False),
'regressor__n_estimators': (100, 150)
}# instantiate grid search
cv = GridSearchCV(pipeline, param_grid=parameters, verbose=2)# see all parameters defined
cv.__dict__
From the hyperparameter combinations we defined above, we expect to have a better model for our problem in this project. Let’s train the grid search model.
# train the grid search model
cv.fit(X_train_scaled, y_train)
In case you’d like to save and load the models trained using Grid Search:
import joblib# save all models from grid search
joblib.dump(cv, 'model/random_forest_cv.pkl')# or just save the best model
joblib.dump(cv.best_estimator_,
'model/random_forest_cv_best.pkl',
compress = 1)# load model for further usage
joblib.load("model/random_forest_cv.pkl")
joblib.load("model/random_forest_cv_best.pkl")
The GridSearchCV is trained on the train set. As mentioned by the training result above, there are 20 fits executed in which the number comes from 4 model candidates times 5 folds. Let’s see the grid search results. The results are ranked from the best to the worst.
cv_results = pd.DataFrame(cv.cv_results_)
cv_results = cv_results.sort_values('rank_test_score') \
.reset_index(drop=True)cv_results.head()
print('Best Random Forest by GridSearchCV:', end='\n\n')best_one = cv_results.loc[0, :]
for col, val in best_one.iteritems():
if isinstance(val, dict):
continue
else:
print(f"{col}: {val}")
Then, let’s test the model on test set.
Take away points
From all models, we can see that each model does not perform good results. Random Forest provides the best performance in this case so far. Even, the NN model which should give better performance is unable to deliver the best performance.
After we improve the Random Forest model by using Grid Search backed by 5-fold Cross-Validation, resulting in 20 times fits, it is expected the model would have much better performance. However, sadly, its performance just improves slightly.
Conclusions
Reflection
I found this project challenging, mainly due to the structure of the data in the transcript
dataset. My goal is to predict the amount that will be spent by a customer. So, the decision-makers would know how customers behave in purchasing Starbucks products. Therefore, I need all rows with amount
existing, i.e. the rows with event
col equal to transaction
. However, using only this subset of the dataset, in fact, the models I've implemented can't give good performance. Even, by checking the correlations among all columns, we are unable to capture any strong relationship between the amount
and other columns.
Perhaps, in order to get broader factors, we should consider the other events, i.e. offer received
, offer viewed
, and offer completed
. But, if we include the three, we will have many missing values. This is contradictory.
Potential improvements
Due to time reasons, I don’t have a chance to perform enhancement on data or model tuning. For example, I could do more experiments on feature engineering to see if adding/substracting features can improve the model; or I could try other combinations of model hyperparameters to see if this will affect the model performance.
In addition, the analysis and modelling should also consider the condition of customers who don’t make any transaction as such a condition also provide more insight.
Last but not least, the computing resources should also be considered. In my case, where I need GPU computing, I have to go to Google Colab. But unfortunately, when I need more CPU computings for Grid Search, I ‘ve found Google Colab doesn’t provide multiprocessing. Likewise, my local computer also doesn’t give sufficient multprocessing computing performance for the case. I think, for handling real world data for enterprises or startups, it wouldn’t be excessive to rent any cloud computing.
Acknowledgments
Acknowledgment should go to Starbucks for providing the dataset. This project is a capstone project of Data Scientist Nanodegree on Udacity.