Define martketing strategies to increase the revenues generated by the customers of the website through a better understanding of the customer segments.
We followed a 3 steps process:
Part 1: Data set inspection and dimensionality reduction
Part 2: Data analysis and marketing insights
Part 3: Final recommendation and strategy
First step: we load the data to use:
The data refers to the users and their interaction with the website until that time. Overall the dataset contains information on almost 1 million customers grouped in 667 types of customers described by 19 number of variables.
Variables description:
- Type of User: Returning or New Visitor
- Source: Organic vs Paid Promotions (cpc, cpm, referral)
- Users: Number of User Visits
- Sessions: Number of User Sessions
- No of Pages Visited: Number of total pages visited
- No of Transactions: Number of monetary transactions
- Revenue: Revenue generated
- Dummy Variable: Created dummy variables categorical data such as 'user type', 'source' and 'device type'
factor_attributes_used = c(1:19)
factor_selectionciterion =
minimum_variance_explained = 65
manual_numb_factors_used = 5
rotation_used = "varimax"
max_data_report = 5
factor_attributes_used <- intersect(factor_attributes_used, 1:ncol(myData))
ProjectDataFactor <- myData[,factor_attributes_used]
ProjectDataFactor <- as.matrix(ProjectDataFactor)
We start by doing a basic visual exploration of the first customer types:
**Data Visualization**
Customer01 | Customer02 | Customer03 | Customer04 | Customer05 | |
---|---|---|---|---|---|
Type.of.User | Returning Visitor | Returning Visitor | New Visitor | Returning Visitor | Returning Visitor |
Source | google / organic | google / organic | (direct) / (none) | google / organic | google / cpc |
Device | desktop | desktop | desktop | desktop | desktop |
Returning.Visitor | 1 | 1 | 0 | 1 | 1 |
New.Visitor | 0 | 0 | 1 | 0 | 0 |
Google.CPC | 0 | 0 | 0 | 0 | 1 |
TripAdvisor | 0 | 0 | 0 | 0 | 0 |
Direct | 0 | 0 | 1 | 0 | 0 |
DFA | 0 | 0 | 0 | 0 | 0 |
0 | 0 | 0 | 0 | 0 | |
Referral | 0 | 0 | 0 | 0 | 0 |
Desktop | 1 | 1 | 1 | 1 | 1 |
Mobile | 0 | 0 | 0 | 0 | 0 |
Tablet | 0 | 0 | 0 | 0 | 0 |
Users | 7240 | 21442 | 2989 | 2637 | 3538 |
Sessions | 10776 | 35015 | 2996 | 4112 | 5179 |
No.of.Pages.Visited | 45328 | 173192 | 7989 | 16936 | 21534 |
No.of.Transactions | 64 | 130 | 13 | 15 | 35 |
Revenue | 456710.54 | 362368.88 | 341553.90 | 278226.12 | 216213.00 |
We can then analyze the data in terms of correlation between different variables. The objetctive is to get a first idea of possible cross-relationships between variables that could lead to a dimensionality reduction as a first step for the analysis.
**Correlation Table**
Returning.Visitor | New.Visitor | Google.CPC | TripAdvisor | Direct | DFA | Referral | Desktop | Mobile | Tablet | Users | Sessions | No.of.Pages.Visited | No.of.Transactions | Revenue | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Returning.Visitor | 1.00 | -1.00 | -0.03 | -0.01 | -0.19 | 0.11 | 0.08 | 0.09 | 0.10 | -0.06 | -0.07 | -0.25 | -0.18 | -0.16 | 0.05 | 0.03 |
New.Visitor | -1.00 | 1.00 | 0.03 | 0.01 | 0.19 | -0.11 | -0.08 | -0.09 | -0.10 | 0.06 | 0.07 | 0.25 | 0.18 | 0.16 | -0.05 | -0.03 |
Google.CPC | -0.03 | 0.03 | 1.00 | -0.07 | -0.13 | -0.09 | -0.07 | -0.23 | 0.13 | -0.09 | -0.09 | 0.04 | 0.05 | 0.05 | 0.14 | 0.10 |
TripAdvisor | -0.01 | 0.01 | -0.07 | 1.00 | -0.10 | -0.07 | -0.05 | 0.10 | -0.09 | 0.01 | 0.10 | -0.06 | -0.07 | -0.06 | -0.06 | -0.06 |
Direct | -0.19 | 0.19 | -0.13 | -0.10 | 1.00 | -0.13 | -0.10 | -0.31 | -0.13 | 0.12 | 0.06 | 0.15 | 0.14 | 0.10 | 0.11 | 0.16 |
DFA | 0.11 | -0.11 | -0.09 | -0.07 | -0.13 | 1.00 | -0.07 | -0.22 | 0.01 | -0.02 | 0.01 | -0.05 | -0.03 | -0.04 | 0.01 | -0.01 |
0.08 | -0.08 | -0.07 | -0.05 | -0.10 | -0.07 | 1.00 | -0.15 | 0.12 | -0.07 | -0.09 | -0.07 | -0.07 | -0.07 | -0.06 | -0.06 | |
Referral | 0.09 | -0.09 | -0.23 | 0.10 | -0.31 | -0.22 | -0.15 | 1.00 | -0.05 | 0.04 | 0.03 | -0.22 | -0.24 | -0.23 | -0.19 | -0.20 |
Desktop | 0.10 | -0.10 | 0.13 | -0.09 | -0.13 | 0.01 | 0.12 | -0.05 | 1.00 | -0.60 | -0.74 | -0.07 | -0.10 | -0.13 | 0.13 | 0.11 |
Mobile | -0.06 | 0.06 | -0.09 | 0.01 | 0.12 | -0.02 | -0.07 | 0.04 | -0.60 | 1.00 | -0.10 | 0.12 | 0.13 | 0.18 | -0.07 | -0.06 |
Tablet | -0.07 | 0.07 | -0.09 | 0.10 | 0.06 | 0.01 | -0.09 | 0.03 | -0.74 | -0.10 | 1.00 | -0.01 | 0.01 | 0.00 | -0.10 | -0.09 |
Users | -0.25 | 0.25 | 0.04 | -0.06 | 0.15 | -0.05 | -0.07 | -0.22 | -0.07 | 0.12 | -0.01 | 1.00 | 0.98 | 0.92 | 0.46 | 0.39 |
Sessions | -0.18 | 0.18 | 0.05 | -0.07 | 0.14 | -0.03 | -0.07 | -0.24 | -0.10 | 0.13 | 0.01 | 0.98 | 1.00 | 0.95 | 0.53 | 0.45 |
No.of.Pages.Visited | -0.16 | 0.16 | 0.05 | -0.06 | 0.10 | -0.04 | -0.07 | -0.23 | -0.13 | 0.18 | 0.00 | 0.92 | 0.95 | 1.00 | 0.44 | 0.40 |
No.of.Transactions | 0.05 | -0.05 | 0.14 | -0.06 | 0.11 | 0.01 | -0.06 | -0.19 | 0.13 | -0.07 | -0.10 | 0.46 | 0.53 | 0.44 | 1.00 | 0.74 |
Revenue | 0.03 | -0.03 | 0.10 | -0.06 | 0.16 | -0.01 | -0.06 | -0.20 | 0.11 | -0.06 | -0.09 | 0.39 | 0.45 | 0.40 | 0.74 | 1.00 |
From correlation matrix we can see that there is a good positive correlation between number of Sessions, number of pages visited, number of transactions and revenues. We can also see that Desktop device is the one that generates most of the traffic and revenues in absolute value.
To use Factor Analysis, we first need to adjust data to have only numeric data. We can then remove first columns and keep only binary values.
num_data <- myData[,4:ncol(myData)]
scaled_data <- apply(num_data,2, function(r) {if (sd(r)!=0) res=(r-mean(r))/sd(r) else res=0*r; res})
Now that we have only numeric data, we can proceed with Factor Analysis.
Variance_Explained_Table_results<-PCA(scaled_data, graph=FALSE)
Variance_Explained_Table<-Variance_Explained_Table_results$eig
Variance_Explained_Table<-as.data.frame(Variance_Explained_Table)
colnames(Variance_Explained_Table)<-c("Eigenvalue", "Percentage_of_explained_variance", "Cumulative_percentage_of_explained_variance")
eigenvalues <- Variance_Explained_Table[,2]
Let's look at the variance explained as well as the eigenvalues
**Variance Explained**
Eigenvalue | Percentage_of_explained_variance | Cumulative_percentage_of_explained_variance | |
---|---|---|---|
comp 1 | 3.84 | 24.01 | 24.01 |
comp 2 | 2.34 | 14.61 | 38.61 |
comp 3 | 1.81 | 11.29 | 49.91 |
comp 4 | 1.29 | 8.07 | 57.98 |
comp 5 | 1.17 | 7.33 | 65.31 |
comp 6 | 1.10 | 6.88 | 72.19 |
comp 7 | 1.05 | 6.55 | 78.74 |
comp 8 | 1.02 | 6.39 | 85.13 |
comp 9 | 0.92 | 5.74 | 90.87 |
comp 10 | 0.78 | 4.89 | 95.76 |
comp 11 | 0.34 | 2.15 | 97.91 |
comp 12 | 0.25 | 1.56 | 99.47 |
comp 13 | 0.07 | 0.45 | 99.93 |
comp 14 | 0.01 | 0.07 | 100.00 |
comp 15 | 0.00 | 0.00 | 100.00 |
comp 16 | 0.00 | 0.00 | 100.00 |
We can see from the chart above that we could use only 8 components. This can be easily explained looking at the variables. Some variables are binary values conveying the same meaning as the text variables. We can identify the following groups:
- Device
- Returning/New Visitor
- Source
- Revenues
- Number of Sessions
- Number of pages visited
- Number of Users
- Number of Transactions
In this case we suggest to keep all the original variables since they are just a way to provide the same information in a binary form.
To analyze different scenarios and identify possible strategies, we can focus on different behaviors across devices.
averageTransaction <- aggregate(myData$No.of.Transactions,list(myData$Device), FUN=sum)
colnames(averageTransaction) <- c("Device","No of Transactions")
knitr::kable(averageTransaction)
Device | No of Transactions |
---|---|
desktop | 3553 |
mobile | 119 |
tablet | 120 |
We can clearly see that desktop is the main channel used by customers to access the website, generating about 95% of total transactions. We can now focus on revenues to check if we find the same trend or a different behavios depending on the device.
averageRevenue <- aggregate(myData$Revenue,list(myData$Device), FUN=sum)
colnames(averageRevenue) <- c("Device","Revenues")
knitr::kable(averageRevenue)
Device | Revenues |
---|---|
desktop | 8459779.1 |
mobile | 291844.2 |
tablet | 310228.0 |
Even an analysis on revenues shows that desktop is the most used device, as expected from previous result. | |
We can then analyze revenues per transaction to check profitability for each device. |
RevTransaction <- averageRevenue
RevTransaction[,2] <- averageRevenue[,2]/averageTransaction[,2]
colnames(RevTransaction) <- c("Device","Revenue per Transaction")
knitr::kable(RevTransaction)
Device | Revenue per Transaction |
---|---|
desktop | 2381.024 |
mobile | 2452.472 |
tablet | 2585.233 |
We can see that revenues per transaction are quite stable cross-device, even though tablet is more profitable if compared to desktop and mobile devices.
Moving from devices to sources, we can then compare organic sources to payed sources (e.g. "cpc") and see if there is a strong relationship between transaction and revenues and different sources.
averageRevenueSource <- aggregate(myData$Revenue,list(myData$Source), FUN=sum)
colnames(averageRevenueSource) <- c("Source","Revenues")
averageRevenueSource <- averageRevenueSource[order(averageRevenueSource[,2],averageRevenueSource[,1],decreasing=TRUE),]
rownames(averageRevenueSource) <- c(1:nrow(averageRevenueSource))
knitr::kable(averageRevenueSource[1:10,])
Source | Revenues |
---|---|
(direct) / (none) | 2951215.28 |
google / organic | 2793075.46 |
google / cpc | 1466490.42 |
dfa / cpm | 430017.88 |
dfa / cpc | 226526.63 |
bing / organic | 186316.63 |
email / email | 109285.60 |
yahoo / organic | 92137.69 |
olehotels.com / referral | 69812.64 |
adquiramexico.com.mx / referral | 65988.00 |
Focusing on top 10 sources, we can see that direct access and organic searches on google or bing provide more than 4 times the revenues of payed sources like "cpc" and "cpm".
From previous analysis, we can suggest the following strategy:
- reduce spending in payed sources;
- provide app/mobile portal to facilitate the access from tablet;