Data Analysis
Introduction
This report entails data analysis involving the R statistical package. The five variables include sales, calls, years, time, and type. The analysis will involve summarizing the data, analyzing each variable by itself using graphical and numerical techniques of summarization. In terms of interpretation, stem-leaf diagram, frequency table, histogram, boxplot, dot plot, pie chart, and bar graph among others. Also appropriate measures of central tendency, the measures of dispersion, and the shapes of distributions (for the quantitative variable for the data provided). Where appropriate five-number summary (the Min, Q1, Median, Q3, Max).
Analysis involving connections or relationships between the variables will also be done. The pairing of variables to identify whether there is the existence of any relationship is considered as well as numerical summary measures and graphs as well. This will be in line to reflect the variables that show relationships and those that do not. In the analysis the variables represent the following:
Sales: represents the number of sales this week
Calls: represents the number of sales calls made this week
Time: represents the average number per call this week
Years: represents years of experience in the call center
Type: represents the type of training the employee received.
Presentation of the Results and Inferences
Numerical measures
Sales | |
Mean | 43.35 |
Min | 21 |
Q1 | 41.75 |
Median | 44 |
Q2 | 47 |
Max | 54 |
Standard Deviation | 6.2 |
Variance | 38.45 |
Numerical summary
Calls | |
Mean | 158.9 |
Min | 116 |
Q1 | 146 |
Median | 157.5 |
Q2 | 173.2 |
Max | 198 |
Standard Deviation | 18.5 |
Variance | 342.14 |
Numerical summary
Time | |
Mean | 15.53 |
Min | 10 |
Q1 | 13.2 |
Median | 15.35 |
Q2 | 17.55 |
Max | 22 |
Standard Deviation | 2.72 |
Variance | 7.31 |
Numerical summary for Years
Years | |
Mean | 2.22 |
Min | 0 |
Q1 | 1 |
Median | 2 |
Q2 | 3 |
Max | 5 |
Standard Deviation | 1.26 |
Variance | 1.59 |
Type | |
Mode | Online |
Frequency | |
online | 46 |
group | 41 |
none | 13 |
Interpretation
The above boxplots tell us the following,
All the three variables above that are sales, calls and time are quite close to normal as the middle black line is close to the center, with equal whiskers on both ends.
There are no outliers in the data for all three variables. Also, the data points are covered in the boxplot with no data set seen to be too high or too low.
The numerical measure summary simply shows the summary statistics of the 3 variables. These include meaning, median, upper quartile, lower quartile, variance, standard deviation, the maximum, and minimum. This helps ready to get a clear understanding of the data interpretation.
Scatter Plot for Sales against Calls.
From the above scatter diagram of sales against calls, we note that there is a positive linear relationship where the variables are uniformly spread both on the right and left. This indicates that both of the variables are linear predictors of one another and thus have a relationship against each other. There is a positive correlation which is close to one. When an OLS line is introduced as it can be seen above it cuts across all the data points which are uniformly spread and this strongly supports our relationship. Therefore as sales increases, the number of calls also increase, also a decrease in the number of calls decreases the sales.
Boxplot by Type is as follows
Interpretation
The boxplot above shows there are no outliers in the variables. It also indicates the level of skewness of the variables. Also, all the data points are covered within the boxplot. The ggboxplot also shows the relationship among all the variables. The boxplot also shows that the majority of the calls were online as it can be seen from the above figure, it was followed by group calls then finally the none type of calls which recorded the lowest number as it can be seen in each case. When it comes to time the none type represented the bigger whiskey followed by group type and finally the online type. In the sales part, the online type had the largest whiskey. Thus all the variables are good predictors of each other.
Conclusion
The type of mode use is online as it represents the highest number in the frequency table. The 3 variables whereby the boxplot was drawn also shown no outliers meaning the data was okay and analysis has no error that may arise from multicollinearity or data set is too large. The scatter plot of sales against calls also shows a positive linear relationship. When the OLS line is introduced it fits well with an equal amount of points lying on both sides that are left and right. This indicates a positive correlation with the r-value being close to one. Therefore this suggests that as the number of sales increases, the number of calls also increases. Also, a decrease in the number of sales decreases the number of calls. The boxplot for type with all the rest of the variables shows no outliers suggesting no data set seemed to be too high or too low.
The appendix represents the codes used.
> library(readxl)
> Project_Data_SALESCALL_2_1_ <- read_excel(“C:/Users/brayo-onyas/Desktop/New folder (11)/Project_Data_SALESCALL_2 (1).xlsx”)
> View(Project_Data_SALESCALL_2_1_)
sattach(Project_Data)
boxplot(`Sales (Y)`,data=Project_Data_SALESCALL_2_1_,ylab=”Values”,main=”Boxplots of the Sales Data”,col=”yellow”,horizontal=TRUE)
summary(Sales(Y))
attach(Project_Data_SALESCALL_2_1_)
summary(`Sales (Y)`)
boxplot(`Calls (X1)`,data=Project_Data_SALESCALL_2_1_,ylab=”Values”,main=”Boxplots of the Calls Data”,col=”purple”,horizontal=TRUE)
summary(`Calls (X1)`)
sd(`Calls (X1)`)
var(`Calls (X1)`)
boxplot(`Time (X2)`,data=Project_Data_SALESCALL_2_1_,ylab=”Values”,main=”Boxplots of the Time Data”,col=”blue”,horizontal=TRUE)
summary(`Time (X2)`)
sd(`Time (X2)`)
var(`Time (X2)`)
library(“ggpubr”)
ggscatter(Project_Data_SALESCALL_2_1_, x = `Calls (X1)`, y = `Sales (Y)`,add = “reg.line”, # Add regression lineconf.int = TRUE, # Add confidence intervaladd.params = list(color = “blue”,fill = “lightgray”))+stat_cor(method = “pearson”)
)+
stat_cor(method = “pearson”) # Add correlation coefficient
ggscatter(Project_Data_SALESCALL_2_1_, x = `Sales (Y)`, y = `Calls (X1)`,
add = “reg.line”, # Add regression line
conf.int = TRUE, # Add confidence interval
add.params = list(color = “blue”,
fill = “lightgray”)
)+
summary
cor(`Sales (Y)`~`Calls (X1)`)
ggboxplot(Project_Data_SALESCALL_2_1_, x = Type,y = c(`Sales (Y)`,`Calls (X1)`,`Time (X2)`),combine = TRUE,color = “Type”, palette = “jco”,ylab = “Values”,add = “jitter”,add.params = list(size = 0.1, jitter = 0.2), label = “Years”,label.select = list(top.up = 1, top.down = 1),font.label = list(size = 9, face = “italic”),repel = TRUE)
summary(`Years (X3)`)
sd(`Years (X3)`)
var(`Years (X3)`)
frequency(`Sales (Y)`)
summary
sd(`Years (X3)`)
count(Type)
table(Type)