In this exercise, we are asked to “identify” the VIP customers in the decision set of 30,000 customers. Data for the decision set of customers are found in etots_decision.csv. We will do this based on a model that we develop using the data in etots_training.csv, which contains the same type of data for another random sample of 30,000 customers.
We must decide which of the 30,000 customers in the decision set to designate for VIP treatment.
Your score would be zero if no one is assigned for VIP treatment (i.e., no incremental revenue and no incremental cost). What would be the score if you designate everyone in the etots training set for VIP treatment? Describe your calculations.
Based on the data in the etots training set, what is the value of perfect information (i.e., the difference in scores between designating everyone for VIP treatment vs. designating only those who “deserved” such treatment given their response)? Describe your calculations.
Describe the model you used to identify customers for VIP treatment.
Our decisions will be scored as follows:
We will be charged $3.25 for each VIP treatment;
We will receive an incremental revenue of $9.80 from each VIP-treated customer who made a purchase during Holiday 2000 (i.e., HOL_REP = 1).
Customers receiving regular treatment will not affect your score
The following table summarizes the above rules:
Let’s look at the dataset:
library(tidyverse)
library(data.table)
library(corrplot)
library(DT)
library(caret)
training_data <- read_csv("/Users/eduardocarrascosa/Desktop/MS-Analytics/Marketing-Analytics/Final/etots_training.csv")
decision_data <- read_csv("/Users/eduardocarrascosa/Desktop/MS-Analytics/Marketing-Analytics/Final/etots_decision.csv")
Lets first have a look at these files:
There are a total of 36 features per customer. You can scroll the x-axis to see all features.
The feature names are (lets face it) not really interpretable well.
How should you now what “finishedsquarefeet15” is for example. So, I am
going to rename them here. This will make working with the dataset a lot
easier (more consistency, and shorter names):
training_data <- training_data %>%
rename(
holiday_2020_purchase = HOL_REP,
holiday_2020_total_spend = HOL_MVAL,
holiday_2019_average_order_size = S_99HOL,
holiday_2019_total_spend = M_VAL99H,
offseason_2020_average_order_size = S_00OFF,
offseason_2020_total_spend = M_VAL00O,
recency = REC,
tenure = TENURE,
self_index = S_IDX,
not_self_index = N_S_INDX,
nov_2007 = J_9711,
dec_2007 = J_9712,
jan_2008 = J_9801,
feb_2008 = J_9802,
mar_2008 = J_9803,
apr_2008 = J_9804,
may_2008 =J_9805,
jun_2008 = J_9806,
jul_2008 = J_9807,
aug_2008 = J_9808,
sep_2008 = J_9809,
oct_2008 = J_9810,
nov_2008 = J_9811,
dec_2008 = J_9812,
jan_2009 = J_9901,
feb_2009 = J_9902,
mar_2009 = J_9903,
apr_2009 = J_9904,
may_2009 = J_9905,
jun_2009 = J_9906,
jul_2009 = J_9907,
aug_2009 = J_9908,
sep_2009 = J_9909
)
Much better :-). You can scroll the x-axis to see all features.
Gift and Self have missing data, lets fill with zero.
colSums(is.na(training_data))
training_data <-
mutate(training_data, Missing_Self = is.na(Self), Missing_Gift = is.na(Gift)) %>%
replace_na(list(Self = 0, Gift = 0))
The holiday_2020_total_spend variable is not available in the decision set so lets remove it from the training data and create the target variable.
training_data <- training_data %>%
select(-holiday_2020_total_spend)
How many customer rows in the data?
n <- nrow(training_data)
n
## [1] 30000
Check if the target variable is balanced - minority class is about 33% of the dataset, mildly imbalanced.
training_data %>%
group_by(holiday_2020_purchase) %>%
ggplot(aes(x = holiday_2020_purchase, y = n)) +
geom_bar(stat = "identity", fill = "red")
target <- factor(training_data$holiday_2020_purchase)
training_p <- 0.2
validation_p <- 0.35
validation_size <- validation_p * n
set.seed(909439224)
all_rows = 1:n
training_index <- createDataPartition(target, p = training_p, list=TRUE)$Resample1
validation_index <- sample(setdiff(n, training_index), size = validation_size)
test_index <- setdiff(setdiff(all_rows, training_index), validation_index)
training_downsample <- downSample(training_index, target[training_index], list = FALSE)$x
training_upsample <- upSample(training_index, target[training_index], list = FALSE)$x
The target variable is now balanced with equal representation of buyers and non-buyers.
training_data %>%
slice(training_upsample) %>%
ggplot(aes(x = holiday_2020_purchase, y = n)) +
geom_bar(stat="identity", fill = "red")
trainig_data <- training_data %>% mutate(
holiday_2019_average_order_size = scale(holiday_2019_average_order_size),
holiday_2019_total_spend = scale(holiday_2019_total_spend),
offseason_2020_average_order_size = scale(offseason_2020_average_order_size),
offseason_2020_total_spend = scale(offseason_2020_total_spend),
recency = scale(recency),
tenure = scale(tenure),
self_index = factor(self_index),
not_self_index = factor(not_self_index),
nov_2007 = factor(nov_2007),
dec_2007 = factor(dec_2007),
jan_2008 = factor(jan_2008),
feb_2008 = factor(feb_2008),
mar_2008 = factor(mar_2008),
apr_2008 = factor(apr_2008),
may_2008 = factor(may_2008),
jun_2008 = factor(jun_2008),
jul_2008 = factor(jul_2008),
aug_2008 = factor(aug_2008),
sep_2008 = factor(sep_2008),
oct_2008 = factor(oct_2008),
nov_2008 = factor(nov_2008),
dec_2008 = factor(dec_2008),
jan_2009 = factor(jan_2009),
feb_2009 = factor(feb_2009),
mar_2009 = factor(mar_2009),
apr_2009 = factor(apr_2009),
may_2009 = factor(may_2009),
jun_2009 = factor(jun_2009),
jul_2009 = factor(jul_2009),
aug_2009 = factor(aug_2009),
sep_2009 = factor(sep_2009)
)
At this stage we will not use “test set”. We will try different models and decide which one is better. After trying different models we will settle on one.
model_1 <- train(
factor(holiday_2020_purchase) ~.,
data = training_data[training_downsample, ],
method = 'rpart'
)
score_validation <- predict(model_1, newdata = training_data[validation_index, ])
confusionMatrix(score_validation, target[validation_index], positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5204 888
## 1 1780 2628
##
## Accuracy : 0.7459
## 95% CI : (0.7375, 0.7542)
## No Information Rate : 0.6651
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4634
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.7474
## Specificity : 0.7451
## Pos Pred Value : 0.5962
## Neg Pred Value : 0.8542
## Prevalence : 0.3349
## Detection Rate : 0.2503
## Detection Prevalence : 0.4198
## Balanced Accuracy : 0.7463
##
## 'Positive' Class : 1
##
model_2 <- train(
factor(holiday_2020_purchase) ~ .,
data = training_data[training_downsample, ],
method = 'rf'
)
score_validation <- predict(model_2, newdata = training_data[validation_index, ])
confusionMatrix(score_validation, target[validation_index], positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5224 899
## 1 1760 2617
##
## Accuracy : 0.7468
## 95% CI : (0.7383, 0.7551)
## No Information Rate : 0.6651
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4641
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.7443
## Specificity : 0.7480
## Pos Pred Value : 0.5979
## Neg Pred Value : 0.8532
## Prevalence : 0.3349
## Detection Rate : 0.2492
## Detection Prevalence : 0.4169
## Balanced Accuracy : 0.7462
##
## 'Positive' Class : 1
##
model_3 <- train(
factor(holiday_2020_purchase) ~ .,
data = training_data[training_downsample, ],
method = 'xgbTree',
verbosity = 0
)
score_validation <- predict(model_3, newdata = training_data[validation_index, ])
confusionMatrix(score_validation, target[validation_index], positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 5211 896
## 1 1773 2620
##
## Accuracy : 0.7458
## 95% CI : (0.7374, 0.7541)
## No Information Rate : 0.6651
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4626
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.7452
## Specificity : 0.7461
## Pos Pred Value : 0.5964
## Neg Pred Value : 0.8533
## Prevalence : 0.3349
## Detection Rate : 0.2495
## Detection Prevalence : 0.4184
## Balanced Accuracy : 0.7456
##
## 'Positive' Class : 1
##
Let’s use the gradient boosting model which is the highest performing for identifying purchasers in the data for our test set.
score_test <- predict(model_3, newdata = training_data[test_index, ])
confusionMatrix(score_test, target[test_index], positive="1")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 7809 1405
## 1 2558 3806
##
## Accuracy : 0.7456
## 95% CI : (0.7387, 0.7524)
## No Information Rate : 0.6655
## P-Value [Acc > NIR] : < 0.00000000000000022
##
## Kappa : 0.4584
##
## Mcnemar's Test P-Value : < 0.00000000000000022
##
## Sensitivity : 0.7304
## Specificity : 0.7533
## Pos Pred Value : 0.5981
## Neg Pred Value : 0.8475
## Prevalence : 0.3345
## Detection Rate : 0.2443
## Detection Prevalence : 0.4085
## Balanced Accuracy : 0.7418
##
## 'Positive' Class : 1
##
If we assign everyone in the training set to the VIP program we would make only $500 because 20,000 customers who were not purchasers would cost us 20,000 * (-$3.25) = $65,000 and the 10,000 customers who are purchasers would make us 10,000 * $6.55 = $65,500. So $65,500 - $65,000 = $500.
The value of “perfect information” if our model classifies everyone in the training set correctly our prediction score would be $65,500. (10,000 * $6.55 = $65,500)
Our best performing model is the gradient boosting model. In our test set we identified 3,806 customers for VIP treatment that were purchasers for a gain of $24,929.30 (3,806 * $6.55 = $24,929.30). Our model also identified a total of 2,558 customers for the VIP treatment that were not purchasers. This costs our model a total of $8,313.50 (2,558 * 3.25 = $8,313.50). Our total prediction score for our best performing model is $16,615.80 ($24,929.30 - $8,313.50 = $16,615.80).