By Salerno | December 23, 2019
Random Forest
In this post we will explore some ideas around the Random Forest model
Objective
We are working on in the dataset called Boston Housing and the main idea here is regression task and we are concerned with modeling the price of houses in thousands of dollars in the Surburb of Boston.
So, we are dirting our hands in a regression predictive modeling problem.
The main goal here is to fit a regression model that best explains the variation in medv
variable.
Data
In terms of dataset, we are using a file from UCI and their content is related of Housing Values in Suburbs of Boston.
# to get the data
BHData <- read.table(url("https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data"), sep = "")
For our study we are working on 506 rows (events) and 14 columns. One of them called medv
is our target value - y
or response variable.
# knowing the dimension of the data
dim(BHData)
## [1] 506 14
Set names of the dataset
# changing the variable's names
names(BHData)<- c("crim","zn","indus","chas","nox","rm",
"age","dis","rad","tax","ptratio","black","lstat","medv")
EDA (Exploratory Data Analysis)
Usually as a first task, we use some Exploratory Data Analysis to understand how the data is distributed and extract preliminary knowledge.
# structure
str(BHData)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
As you can see using the summary()
function below, there are variables with different ranges. It is could badly impact the response variable if we have had a less numeric range between each of the predictors variables.
As we have to improve the predictive accuracy of our model we have not allowed that this large difference in the range of variables impact the accuracy of the predicting task upon the medv
variable.
You will see the adequate treatment in the Pre-processing topic.
summary(BHData)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Pre-processing
Remove outliers
#cut-off values are given by the formal definition of an outlier:
#Q3 + 1.5*IQR
upper_cut_off1 <- (quantile(BHData$crim, 0.75)) + (IQR(BHData$crim))*1.5
upper_cut_off1
## 75%
## 9.069639
upper_cut_off2 <- (quantile(BHData$zn, 0.75)) + (IQR(BHData$zn))*1.5
upper_cut_off2
## 75%
## 31.25
upper_cut_off3 <- (quantile(BHData$indus, 0.75)) + (IQR(BHData$indus))*1.5
upper_cut_off3
## 75%
## 37.465
upper_cut_off4 <- (quantile(BHData$chas, 0.75)) + (IQR(BHData$chas))*1.5
upper_cut_off4
## 75%
## 0
Feature Scaling
It is an important step called featured scaling
to get all the data scaled in the range [0,1]. This method has chosen can be called as well as normalization
.
# calculating the maximun in each column
max_data <- apply(BHData, 2, max)
# calculating the minimun in each column
min_data <- apply(BHData, 2, min)
# applying the normalization
BHDataScaled <- as.data.frame(scale(BHData,center = min_data,
scale = max_data - min_data))
To confirm normalization process:
summary(BHDataScaled)
## crim zn indus chas
## Min. :0.0000000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.0008511 1st Qu.:0.0000 1st Qu.:0.1734 1st Qu.:0.00000
## Median :0.0028121 Median :0.0000 Median :0.3383 Median :0.00000
## Mean :0.0405441 Mean :0.1136 Mean :0.3914 Mean :0.06917
## 3rd Qu.:0.0412585 3rd Qu.:0.1250 3rd Qu.:0.6466 3rd Qu.:0.00000
## Max. :1.0000000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## nox rm age dis
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.00000
## 1st Qu.:0.1317 1st Qu.:0.4454 1st Qu.:0.4338 1st Qu.:0.08826
## Median :0.3148 Median :0.5073 Median :0.7683 Median :0.18895
## Mean :0.3492 Mean :0.5219 Mean :0.6764 Mean :0.24238
## 3rd Qu.:0.4918 3rd Qu.:0.5868 3rd Qu.:0.9390 3rd Qu.:0.36909
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.00000
## rad tax ptratio black
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1304 1st Qu.:0.1756 1st Qu.:0.5106 1st Qu.:0.9457
## Median :0.1739 Median :0.2729 Median :0.6862 Median :0.9862
## Mean :0.3717 Mean :0.4222 Mean :0.6229 Mean :0.8986
## 3rd Qu.:1.0000 3rd Qu.:0.9141 3rd Qu.:0.8085 3rd Qu.:0.9983
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
## lstat medv
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.1440 1st Qu.:0.2672
## Median :0.2657 Median :0.3600
## Mean :0.3014 Mean :0.3896
## 3rd Qu.:0.4201 3rd Qu.:0.4444
## Max. :1.0000 Max. :1.0000
boxplot(BHDataScaled)
According with the graph above there some som variables with outliers. But the crim
predictor variable has the largest number os outliers.
CorBHData<-cor(BHDataScaled)
library(corrplot)
## corrplot 0.84 loaded
corrplot(CorBHData, method = "pie",type="lower")
Multiple Linear Model Fitting
LModel1<-lm(medv~.,data=BHDataScaled)
summary(LModel1)
##
## Call:
## lm(formula = medv ~ ., data = BHDataScaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34654 -0.06066 -0.01151 0.03949 0.58221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.480450 0.052843 9.092 < 2e-16 ***
## crim -0.213550 0.064978 -3.287 0.001087 **
## zn 0.103157 0.030505 3.382 0.000778 ***
## indus 0.012463 0.037280 0.334 0.738288
## chas 0.059705 0.019146 3.118 0.001925 **
## nox -0.191879 0.041253 -4.651 4.25e-06 ***
## rm 0.441860 0.048470 9.116 < 2e-16 ***
## age 0.001494 0.028504 0.052 0.958229
## dis -0.360592 0.048742 -7.398 6.01e-13 ***
## rad 0.156425 0.033910 4.613 5.07e-06 ***
## tax -0.143629 0.043789 -3.280 0.001112 **
## ptratio -0.199018 0.027328 -7.283 1.31e-12 ***
## black 0.082063 0.023671 3.467 0.000573 ***
## lstat -0.422605 0.040843 -10.347 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1055 on 492 degrees of freedom
## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16
Pred1 <- predict(LModel1)
mse1 <- mean((BHDataScaled$medv - Pred1)^2)
mse1
## [1] 0.01081226
plot(BHDataScaled[,14],Pred1,
xlab="Actual",ylab="Predicted")
abline(a=0,b=1)
par(mfrow=c(2,2))
plot(LModel1)
Random Forest Regression Model
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
RFModel=randomForest(medv ~ . , data = BHDataScaled)
RFModel
##
## Call:
## randomForest(formula = medv ~ ., data = BHDataScaled)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 0.00476609
## % Var explained: 88.57
summary(RFModel)
## Length Class Mode
## call 3 -none- call
## type 1 -none- character
## predicted 506 -none- numeric
## mse 500 -none- numeric
## rsq 500 -none- numeric
## oob.times 506 -none- numeric
## importance 13 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 11 -none- list
## coefs 0 -none- NULL
## y 506 -none- numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## terms 3 terms call
plot(RFModel)
VarImp<-importance(RFModel)
VarImp<-as.matrix(VarImp[order(VarImp[,1], decreasing = TRUE),])
VarImp
## [,1]
## lstat 6.0574021
## rm 5.7940364
## indus 1.4530514
## nox 1.3531538
## ptratio 1.3171112
## dis 1.2957303
## crim 1.2250947
## tax 0.7204445
## age 0.5578364
## black 0.3748653
## zn 0.1662593
## rad 0.1450710
## chas 0.1365377
varImpPlot(RFModel)
Pred2 <- predict(RFModel)
plot(BHDataScaled[,14],Pred2,
xlab="Actual",ylab="Predicted")
abline(a=0,b=1)
Robust Linear Regression Model
library(MASS)
LModel2 <- rlm(BHDataScaled$medv ~ ., data = BHDataScaled, psi = psi.hampel, init = "lts")
LModel2
## Call:
## rlm(formula = BHDataScaled$medv ~ ., data = BHDataScaled, psi = psi.hampel,
## init = "lts")
## Converged in 10 iterations
##
## Coefficients:
## (Intercept) crim zn indus chas nox
## 0.285874828 -0.192440257 0.069292252 0.003634696 0.029497415 -0.109020365
## rm age dis rad tax ptratio
## 0.680775604 -0.069407410 -0.275014795 0.093236680 -0.145930494 -0.174170551
## black lstat
## 0.095984860 -0.216404818
##
## Degrees of freedom: 506 total; 492 residual
## Scale estimate: 0.0708
summary(LModel2)
##
## Call: rlm(formula = BHDataScaled$medv ~ ., data = BHDataScaled, psi = psi.hampel,
## init = "lts")
## Residuals:
## Min 1Q Median 3Q Max
## -0.347670 -0.046918 -0.009419 0.048950 0.767535
##
## Coefficients:
## Value Std. Error t value
## (Intercept) 0.2859 0.0408 7.0063
## crim -0.1924 0.0502 -3.8356
## zn 0.0693 0.0236 2.9418
## indus 0.0036 0.0288 0.1263
## chas 0.0295 0.0148 1.9953
## nox -0.1090 0.0319 -3.4226
## rm 0.6808 0.0374 18.1900
## age -0.0694 0.0220 -3.1536
## dis -0.2750 0.0376 -7.3073
## rad 0.0932 0.0262 3.5609
## tax -0.1459 0.0338 -4.3160
## ptratio -0.1742 0.0211 -8.2540
## black 0.0960 0.0183 5.2515
## lstat -0.2164 0.0315 -6.8621
##
## Residual standard error: 0.07084 on 492 degrees of freedom
LM1Coef <- coef(LModel1)
LM2Coef <- coef(LModel2)
plot(BHDataScaled$medv, BHDataScaled$lstat)
abline(coef=LM1Coef)
plot(BHDataScaled$medv, BHDataScaled$lstat)
abline(coef=LM2Coef)
boxplot(BHDataScaled$crim)$out
## [1] 0.1519152 0.1036978 0.1247813 0.2078443 0.2203305 0.1717624 0.1103426
## [8] 0.2657290 0.2007464 1.0000000 0.1783534 0.1031889 0.2256784 0.1888895
## [15] 0.2741094 0.2539149 0.1610363 0.1300618 0.1500899 0.4309939 0.1113886
## [22] 0.2814411 0.1599404 0.1077824 0.2786941 0.4667072 0.7633424 0.2327741
## [29] 0.1342564 0.1622120 0.5746830 0.1578554 0.2113601 0.3220132 0.5141041
## [36] 0.2031955 0.1217028 0.2914951 0.8264345 0.1326964 0.1245487 0.1353478
## [43] 0.1781949 0.1375845 0.4232396 0.1048958 0.1130268 0.1563122 0.1253692
## [50] 0.1620153 0.1705170 0.1536675 0.1054774 0.2477780 0.1092264 0.1119505
## [57] 0.1438237 0.1198774 0.1114819 0.1047857 0.1068599 0.1749961 0.1468899
## [64] 0.1687884 0.1149454 0.1610363
outliers <- boxplot(BHDataScaled$crim, plot=FALSE)$out
print(outliers)
## [1] 0.1519152 0.1036978 0.1247813 0.2078443 0.2203305 0.1717624 0.1103426
## [8] 0.2657290 0.2007464 1.0000000 0.1783534 0.1031889 0.2256784 0.1888895
## [15] 0.2741094 0.2539149 0.1610363 0.1300618 0.1500899 0.4309939 0.1113886
## [22] 0.2814411 0.1599404 0.1077824 0.2786941 0.4667072 0.7633424 0.2327741
## [29] 0.1342564 0.1622120 0.5746830 0.1578554 0.2113601 0.3220132 0.5141041
## [36] 0.2031955 0.1217028 0.2914951 0.8264345 0.1326964 0.1245487 0.1353478
## [43] 0.1781949 0.1375845 0.4232396 0.1048958 0.1130268 0.1563122 0.1253692
## [50] 0.1620153 0.1705170 0.1536675 0.1054774 0.2477780 0.1092264 0.1119505
## [57] 0.1438237 0.1198774 0.1114819 0.1047857 0.1068599 0.1749961 0.1468899
## [64] 0.1687884 0.1149454 0.1610363
BHDataScaled[which(BHDataScaled$crim %in% outliers),]
## crim zn indus chas nox rm age dis rad
## 368 0.1519152 0 0.6466276 0 0.5061728 0.05786549 1.0000000 0.0346461275 1
## 372 0.1036978 0 0.6466276 0 0.5061728 0.50871815 1.0000000 0.0035919214 1
## 374 0.1247813 0 0.6466276 0 0.5823045 0.25771221 1.0000000 0.0040556884 1
## 375 0.2078443 0 0.6466276 0 0.5823045 0.11055758 1.0000000 0.0006729169 1
## 376 0.2203305 0 0.6466276 0 0.5884774 0.71891167 0.9783728 0.0169775118 1
## 377 0.1717624 0 0.6466276 0 0.5884774 0.59168423 0.9309990 0.0195782448 1
## 378 0.1103426 0 0.6466276 0 0.5884774 0.61946733 0.9876416 0.0207694896 1
## 379 0.2657290 0 0.6466276 0 0.5884774 0.54014179 0.9608651 0.0233247552 1
## 380 0.2007464 0 0.6466276 0 0.5884774 0.51005940 1.0000000 0.0233247552 1
## 381 1.0000000 0 0.6466276 0 0.5884774 0.65280705 0.9165808 0.0260891706 1
## 382 0.1783534 0 0.6466276 0 0.5884774 0.57175704 0.9907312 0.0354281661 1
## 383 0.1031889 0 0.6466276 0 0.6481481 0.37842499 1.0000000 0.0409933709 1
## 385 0.2256784 0 0.6466276 0 0.6481481 0.15462732 0.9093718 0.0281806691 1
## 386 0.1888895 0 0.6466276 0 0.6481481 0.32879862 0.9804325 0.0269621439 1
## 387 0.2741094 0 0.6466276 0 0.6481481 0.20904388 1.0000000 0.0306995608 1
## 388 0.2539149 0 0.6466276 0 0.6481481 0.27572332 0.8918641 0.0353554183 1
## 389 0.1610363 0 0.6466276 0 0.6481481 0.25273041 1.0000000 0.0418208768 1
## 393 0.1300618 0 0.6466276 0 0.6481481 0.28262119 0.9691040 0.0582345934 1
## 395 0.1500899 0 0.6466276 0 0.6337449 0.44567925 0.9454171 0.0593349035 1
## 399 0.4309939 0 0.6466276 0 0.6337449 0.36252156 1.0000000 0.0327364985 1
## 400 0.1113886 0 0.6466276 0 0.6337449 0.43897298 0.7713697 0.0337185934 1
## 401 0.2814411 0 0.6466276 0 0.6337449 0.46484001 1.0000000 0.0417572225 1
## 402 0.1599404 0 0.6466276 0 0.6337449 0.53305231 1.0000000 0.0404204821 1
## 403 0.1077824 0 0.6466276 0 0.6337449 0.54474037 1.0000000 0.0463221453 1
## 404 0.2786941 0 0.6466276 0 0.6337449 0.34259437 0.9588054 0.0521237803 1
## 405 0.4667072 0 0.6466276 0 0.6337449 0.37746695 0.8496395 0.0434486082 1
## 406 0.7633424 0 0.6466276 0 0.6337449 0.40659130 1.0000000 0.0268984896 1
## 407 0.2327741 0 0.6466276 0 0.5637860 0.11055758 1.0000000 0.0044103338 1
## 408 0.1342564 0 0.6466276 0 0.5637860 0.39222073 1.0000000 0.0141494421 1
## 410 0.1622120 0 0.6466276 0 0.4362140 0.63058057 1.0000000 0.0305449718 1
## 411 0.5746830 0 0.6466276 0 0.4362140 0.42077026 1.0000000 0.0257708991 1
## 412 0.1578554 0 0.6466276 0 0.4362140 0.59321709 1.0000000 0.0361829243 1
## 413 0.2113601 0 0.6466276 0 0.4362140 0.20444530 1.0000000 0.0385836008 1
## 414 0.3220132 0 0.6466276 0 0.4362140 0.30542249 1.0000000 0.0418117833 1
## 415 0.5141041 0 0.6466276 0 0.6337449 0.18356007 1.0000000 0.0480680919 1
## 416 0.2031955 0 0.6466276 0 0.6049383 0.55048860 1.0000000 0.0641180696 1
## 417 0.1217028 0 0.6466276 0 0.6049383 0.61716804 0.9052523 0.0627358619 1
## 418 0.2914951 0 0.6466276 0 0.6049383 0.33397203 0.8877446 0.0470950904 1
## 419 0.8264345 0 0.6466276 0 0.6049383 0.45909178 1.0000000 0.0611990652 1
## 420 0.1326964 0 0.6466276 0 0.6851852 0.62521556 0.7579815 0.0604170266 1
## 421 0.1245487 0 0.6466276 0 0.6851852 0.54608162 1.0000000 0.0663186898 1
## 423 0.1353478 0 0.6466276 0 0.4711934 0.39988504 0.8722966 0.0747119643 1
## 426 0.1781949 0 0.6466276 0 0.6049383 0.44740372 0.9526262 0.0709290800 1
## 427 0.1375845 0 0.6466276 0 0.4094650 0.43609887 0.5849640 0.0789313352 1
## 428 0.4232396 0 0.6466276 0 0.6049383 0.50603564 0.7806385 0.0666824287 1
## 430 0.1048958 0 0.6466276 0 0.6049383 0.54014179 0.9546859 0.0762578545 1
## 432 0.1130268 0 0.6466276 0 0.4094650 0.62694003 0.9412976 0.0871700206 1
## 435 0.1563122 0 0.6466276 0 0.6748971 0.50718528 0.9485067 0.0993552728 1
## 436 0.1253692 0 0.6466276 0 0.7304527 0.58785208 0.9443872 0.0904891378 1
## 437 0.1620153 0 0.6466276 0 0.7304527 0.55566200 0.9309990 0.0793860088 1
## 438 0.1705170 0 0.6466276 0 0.7304527 0.49645526 1.0000000 0.0713473797 1
## 439 0.1536675 0 0.6466276 0 0.7304527 0.45487641 0.8753862 0.0628358901 1
## 440 0.1054774 0 0.6466276 0 0.7304527 0.39586128 0.9371782 0.0625267121 1
## 441 0.2477780 0 0.6466276 0 0.7304527 0.43245833 0.9217302 0.0669825133 1
## 442 0.1092264 0 0.6466276 0 0.7304527 0.54512359 0.9711637 0.0850694287 1
## 444 0.1119505 0 0.6466276 0 0.7304527 0.56026059 1.0000000 0.0771853886 1
## 445 0.1438237 0 0.6466276 0 0.7304527 0.43935620 0.9649846 0.0696559940 1
## 446 0.1198774 0 0.6466276 0 0.7304527 0.55527879 0.9464470 0.0780492684 1
## 448 0.1114819 0 0.6466276 0 0.7304527 0.51542441 0.9649846 0.0971546527 1
## 449 0.1047857 0 0.6466276 0 0.6748971 0.50277831 0.9866117 0.1029381007 1
## 455 0.1068599 0 0.6466276 0 0.6748971 0.60682123 0.9392379 0.1242622921 1
## 469 0.1749961 0 0.6466276 0 0.4012346 0.45315194 0.7013388 0.1617546763 1
## 470 0.1468899 0 0.6466276 0 0.4012346 0.41233953 0.5540680 0.1540525057 1
## 478 0.1687884 0 0.6466276 0 0.4711934 0.33397203 0.9721936 0.0883067046 1
## 479 0.1149454 0 0.6466276 0 0.4711934 0.50277831 0.9660144 0.0946539479 1
## 480 0.1610363 0 0.6466276 0 0.4711934 0.51120904 0.8764161 0.0747119643 1
## tax ptratio black lstat medv
## 368 0.9141221 0.8085106 0.330576428 0.3200883 0.40222222
## 372 0.9141221 0.8085106 0.922462051 0.2152318 1.00000000
## 374 0.9141221 0.8085106 1.000000000 0.9116998 0.19555556
## 375 0.9141221 0.8085106 1.000000000 1.0000000 0.19555556
## 376 0.9141221 0.8085106 1.000000000 0.3231236 0.22222222
## 377 0.9141221 0.8085106 0.914569570 0.5935430 0.19777778
## 378 0.9141221 0.8085106 1.000000000 0.5383554 0.18444444
## 379 0.9141221 0.8085106 1.000000000 0.6059603 0.18000000
## 380 0.9141221 0.8085106 0.992031873 0.5532561 0.11555556
## 381 0.9141221 0.8085106 1.000000000 0.4271523 0.12000000
## 382 0.9141221 0.8085106 1.000000000 0.5339404 0.13111111
## 383 0.9141221 0.8085106 1.000000000 0.6034768 0.14000000
## 385 0.9141221 0.8085106 0.719930405 0.7974614 0.08444444
## 386 0.9141221 0.8085106 1.000000000 0.8024283 0.04888889
## 387 0.9141221 0.8085106 1.000000000 0.7326159 0.12222222
## 388 0.9141221 0.8085106 1.000000000 0.8349890 0.05333333
## 389 0.9141221 0.8085106 0.939533007 0.7971854 0.11555556
## 393 0.9141221 0.8085106 1.000000000 0.6608720 0.10444444
## 395 0.9141221 0.8085106 1.000000000 0.4034216 0.17111111
## 399 0.9141221 0.8085106 1.000000000 0.7963576 0.00000000
## 400 0.9141221 0.8085106 0.851883605 0.7792494 0.02888889
## 401 0.9141221 0.8085106 1.000000000 0.6909492 0.01333333
## 402 0.9141221 0.8085106 1.000000000 0.5129691 0.04888889
## 403 0.9141221 0.8085106 0.947576781 0.5126932 0.15777778
## 404 0.9141221 0.8085106 1.000000000 0.4977925 0.07333333
## 405 0.9141221 0.8085106 0.829946039 0.7077815 0.07777778
## 406 0.9141221 0.8085106 0.969917797 0.5863687 0.00000000
## 407 0.9141221 0.8085106 0.932724797 0.5963024 0.15333333
## 408 0.9141221 0.8085106 0.836577740 0.2869757 0.50888889
## 410 0.9141221 0.8085106 0.451459983 0.4980684 0.50000000
## 411 0.9141221 0.8085106 0.005749155 0.2312362 0.22222222
## 412 0.9141221 0.8085106 0.087573756 0.5378035 0.27111111
## 413 0.9141221 0.8085106 0.071788794 0.9006623 0.28666667
## 414 0.9141221 0.8085106 0.531166473 0.5063466 0.25111111
## 415 0.9141221 0.8085106 0.221771143 0.9726821 0.04444444
## 416 0.9141221 0.8085106 0.067905593 0.7538631 0.04888889
## 417 0.9141221 0.8085106 0.053583136 0.6639073 0.05555556
## 418 0.9141221 0.8085106 0.320338898 0.6873620 0.12000000
## 419 0.9141221 0.8085106 0.040672752 0.5212472 0.08444444
## 420 0.9141221 0.8085106 0.121362651 0.5797461 0.07555556
## 421 0.9141221 0.8085106 0.802940138 0.3667219 0.26000000
## 423 0.9141221 0.8085106 0.734353724 0.3413355 0.35111111
## 426 0.9141221 0.8085106 0.018558677 0.6252759 0.07333333
## 427 0.9141221 0.8085106 0.061349539 0.3852097 0.11555556
## 428 0.9141221 0.8085106 0.046648848 0.3529249 0.13111111
## 430 0.9141221 0.8085106 0.152302184 0.6167219 0.10000000
## 432 0.9141221 0.8085106 0.204271522 0.4955850 0.20222222
## 435 0.9141221 0.8085106 0.252937617 0.3708609 0.14888889
## 436 0.9141221 0.8085106 0.276186394 0.5943709 0.18666667
## 437 0.9141221 0.8085106 0.068510767 0.4503311 0.10222222
## 438 0.9141221 0.8085106 0.022694034 0.6821192 0.08222222
## 439 0.9141221 0.8085106 0.173054617 0.8910044 0.07555556
## 440 0.9141221 0.8085106 1.000000000 0.5836093 0.17333333
## 441 0.9141221 0.8085106 0.986257502 0.5623620 0.12222222
## 442 0.9141221 0.8085106 0.972414141 0.4908940 0.26888889
## 444 0.9141221 0.8085106 0.974355742 0.4724062 0.23111111
## 445 0.9141221 0.8085106 0.605678552 0.6087196 0.12888889
## 446 0.9141221 0.8085106 0.107771446 0.6139625 0.15111111
## 448 0.9141221 0.8085106 0.978869333 0.4059051 0.16888889
## 449 0.9141221 0.8085106 1.000000000 0.4525386 0.20222222
## 455 0.9141221 0.8085106 0.016037117 0.4685430 0.22000000
## 469 0.9141221 0.8085106 0.928992889 0.4525386 0.31333333
## 470 0.9141221 0.8085106 1.000000000 0.3595475 0.33555556
## 478 0.9141221 0.8085106 0.880427656 0.6396247 0.15555556
## 479 0.9141221 0.8085106 0.956629179 0.4497792 0.21333333
## 480 0.9141221 0.8085106 0.965757224 0.3140177 0.36444444
boxplot(BHDataScaled$crim)
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
remove_outliers(BHDataScaled$crim)
## [1] 0.000000e+00 2.359225e-04 2.356977e-04 2.927957e-04 7.050701e-04
## [6] 2.644715e-04 9.213230e-04 1.553672e-03 2.303251e-03 1.840173e-03
## [11] 2.456674e-03 1.249299e-03 9.830293e-04 7.007315e-03 7.099481e-03
## [16] 6.980677e-03 1.177488e-02 8.743184e-03 8.951232e-03 8.086782e-03
## [21] 1.399878e-02 9.505689e-03 1.378163e-02 1.103868e-02 8.361706e-03
## [26] 9.376432e-03 7.481071e-03 1.067159e-02 8.617186e-03 1.119626e-02
## [31] 1.263900e-02 1.515569e-02 1.552964e-02 1.287402e-02 1.805667e-02
## [36] 6.502201e-04 1.024167e-03 8.297190e-04 1.896485e-03 2.395193e-04
## [41] 3.065082e-04 1.361360e-03 1.519391e-03 1.720133e-03 1.307971e-03
## [46] 1.855684e-03 2.046086e-03 2.505904e-03 2.782402e-03 2.399127e-03
## [51] 9.262685e-04 4.164331e-04 5.314158e-04 4.888171e-04 8.182544e-05
## [56] 7.631796e-05 1.599418e-04 8.991807e-05 1.664945e-03 1.089807e-03
## [61] 1.607286e-03 1.858944e-03 1.168373e-03 1.350794e-03 1.482524e-04
## [66] 3.317977e-04 4.211538e-04 5.796344e-04 1.452402e-03 1.369452e-03
## [71] 9.209858e-04 1.713389e-03 9.589762e-04 2.125101e-03 8.164561e-04
## [76] 9.980906e-04 1.070137e-03 9.076105e-04 5.635615e-04 8.716433e-04
## [81] 3.912560e-04 4.304828e-04 3.402275e-04 3.280886e-04 4.975841e-04
## [86] 5.735649e-04 5.120834e-04 7.327199e-04 5.651351e-04 5.248967e-04
## [91] 4.554350e-04 3.709120e-04 4.013718e-04 2.521078e-04 4.116000e-04
## [96] 1.300665e-03 1.221987e-03 1.287065e-03 8.491638e-04 7.000122e-04
## [101] 1.599867e-03 1.213894e-03 2.500172e-03 2.307410e-03 1.498035e-03
## [106] 1.419582e-03 1.853211e-03 1.403284e-03 1.367879e-03 2.892102e-03
## [111] 1.142072e-03 1.062382e-03 1.314715e-03 2.425540e-03 1.528495e-03
## [116] 1.854785e-03 1.407892e-03 1.625944e-03 1.396652e-03 1.556032e-03
## [121] 7.043957e-04 7.342934e-04 9.741499e-04 1.619200e-03 1.035969e-03
## [126] 1.828709e-03 4.282685e-03 2.841748e-03 3.586719e-03 9.834002e-03
## [131] 3.751157e-03 1.333732e-02 6.560984e-03 3.636062e-03 1.090088e-02
## [136] 6.198277e-03 3.555361e-03 3.889069e-03 2.736656e-03 6.049238e-03
## [141] 3.198611e-03 1.823449e-02 3.725677e-02 4.598275e-02 3.117257e-02
## [146] 2.667217e-02 2.415121e-02 2.655168e-02 2.612873e-02 3.065813e-02
## [151] 1.854875e-02 1.674724e-02 1.259145e-02 2.408523e-02 1.582030e-02
## [156] 3.966162e-02 2.742906e-02 1.368171e-02 1.502216e-02 1.594585e-02
## [161] 1.424235e-02 1.637678e-02 2.054010e-02 1.700238e-02 2.513255e-02
## [166] 3.279402e-02 2.252302e-02 2.016368e-02 2.578491e-02 2.746109e-02
## [171] 1.350007e-02 2.593664e-02 1.492865e-03 9.605498e-04 8.783872e-04
## [176] 6.779823e-04 7.182206e-04 5.387216e-04 6.755095e-04 5.786228e-04
## [181] 6.694400e-04 7.031593e-04 9.521200e-04 1.053840e-03 8.627639e-04
## [186] 6.086329e-04 5.586160e-04 8.140957e-04 1.342814e-03 8.697325e-04
## [191] 9.481861e-04 7.057445e-04 9.027774e-04 1.747783e-04 9.070485e-05
## [196] 8.418579e-05 3.797915e-04 4.534119e-04 3.524788e-04 2.830171e-04
## [201] 1.288076e-04 3.161744e-04 1.736543e-04 3.234803e-04 1.547715e-04
## [206] 1.462293e-03 2.510625e-03 2.761272e-03 1.456111e-03 4.826240e-03
## [211] 1.889853e-03 4.152641e-03 2.370128e-03 1.508376e-03 3.183437e-03
## [216] 2.154662e-03 4.414977e-04 7.172090e-04 1.173094e-03 1.213107e-03
## [221] 3.953810e-03 4.511527e-03 6.937629e-03 6.838045e-03 3.473198e-03
## [226] 5.851531e-03 4.224126e-03 4.564016e-03 3.280548e-03 4.894465e-03
## [231] 5.964715e-03 5.132524e-03 6.395086e-03 3.654608e-03 4.963365e-03
## [236] 3.643143e-03 5.780158e-03 5.681811e-03 8.555704e-04 9.688672e-04
## [241] 1.202317e-03 1.121728e-03 1.085536e-03 1.362821e-03 2.245254e-03
## [246] 2.079468e-03 3.748572e-03 2.138364e-03 1.776669e-03 2.072724e-03
## [251] 1.505903e-03 2.335285e-03 8.529853e-04 4.075761e-03 4.706087e-04
## [256] 3.277514e-04 1.018322e-04 6.802527e-03 7.386657e-03 7.309552e-03
## [261] 5.999671e-03 5.932345e-03 5.775213e-03 9.204688e-03 6.111619e-03
## [266] 8.489390e-03 8.760043e-03 6.429367e-03 6.004054e-03 9.478489e-04
## [271] 3.291451e-03 1.751042e-03 1.217041e-03 2.422842e-03 5.633367e-04
## [276] 1.008431e-03 1.105655e-03 6.176248e-04 8.256727e-04 2.293585e-03
## [281] 3.311233e-04 3.453978e-04 6.178495e-04 9.767350e-05 3.079694e-05
## [286] 5.215248e-05 1.498260e-04 3.640558e-04 4.448697e-04 4.119372e-04
## [291] 3.225811e-04 8.153321e-04 3.352820e-04 8.579308e-04 8.505126e-04
## [296] 1.382490e-03 5.327646e-04 1.514108e-03 6.557275e-04 5.540077e-04
## [301] 4.254249e-04 3.265150e-04 9.704408e-04 1.052941e-03 5.488374e-04
## [306] 5.447911e-04 7.722838e-04 4.833096e-04 5.469941e-03 3.856136e-03
## [311] 2.955112e-02 8.812983e-03 2.870297e-03 2.956731e-03 4.078684e-03
## [316] 2.778918e-03 3.506243e-03 2.685178e-03 4.447573e-03 5.273133e-03
## [321] 1.812748e-03 1.969993e-03 3.875694e-03 3.120157e-03 3.762734e-03
## [326] 2.085425e-03 3.339894e-03 2.638084e-03 6.726996e-04 6.847261e-04
## [331] 4.396994e-04 4.935378e-04 3.185348e-04 5.002817e-04 3.491069e-04
## [336] 3.741716e-04 3.141513e-04 2.707658e-04 3.005512e-04 5.468143e-04
## [341] 6.203223e-04 7.519399e-05 2.097339e-04 2.147918e-04 2.716650e-04
## [346] 2.788584e-04 6.215587e-04 1.391482e-04 9.767350e-05 2.548053e-04
## [351] 6.270661e-04 8.225256e-04 7.431729e-04 1.210522e-04 4.123868e-04
## [356] 1.127011e-03 1.008953e-01 4.319866e-02 5.839561e-02 4.782506e-02
## [361] 5.097905e-02 4.305412e-02 4.127127e-02 4.738761e-02 3.897903e-02
## [366] 5.113585e-02 4.148179e-02 NA 5.498378e-02 6.365817e-02
## [371] 7.342305e-02 NA 9.285086e-02 NA NA
## [376] NA NA NA NA NA
## [381] NA NA NA 8.976251e-02 NA
## [386] NA NA NA NA 9.155256e-02
## [391] 7.818185e-02 5.942157e-02 NA 9.709398e-02 NA
## [396] 9.790313e-02 6.592939e-02 8.616062e-02 NA NA
## [401] NA NA NA NA NA
## [406] NA NA NA 8.314690e-02 NA
## [411] NA NA NA NA NA
## [416] NA NA NA NA NA
## [421] NA 7.886118e-02 NA 7.917399e-02 9.875027e-02
## [426] NA NA NA 8.273350e-02 NA
## [431] 9.537846e-02 NA 7.235853e-02 6.265885e-02 NA
## [436] NA NA NA NA NA
## [441] NA NA 6.361760e-02 NA NA
## [446] NA 7.060536e-02 NA NA 8.451950e-02
## [451] 7.543452e-02 6.108607e-02 5.714125e-02 9.263551e-02 NA
## [456] 5.334446e-02 5.240549e-02 9.210151e-02 8.706216e-02 7.637248e-02
## [461] 5.401615e-02 4.143863e-02 7.472866e-02 6.535729e-02 8.804103e-02
## [466] 3.548707e-02 4.235883e-02 4.963433e-02 NA NA
## [471] 4.880832e-02 4.531972e-02 4.004007e-02 5.215889e-02 9.047410e-02
## [476] 7.178609e-02 5.468244e-02 NA NA NA
## [481] 6.538943e-02 6.408753e-02 6.434582e-02 3.160688e-02 2.666352e-02
## [486] 4.122013e-02 6.390286e-02 5.428073e-02 1.624595e-03 1.989999e-03
## [491] 2.260765e-03 1.117457e-03 1.180175e-03 1.876927e-03 3.071264e-03
## [496] 1.940769e-03 3.183999e-03 2.945491e-03 2.616616e-03 1.927731e-03
## [501] 2.450942e-03 6.329108e-04 4.377886e-04 6.118925e-04 1.160730e-03
## [506] 4.618417e-04
BHDataScaled$crim[!BHDataScaled$crim %in% boxplot.stats(BHDataScaled$crim)$out]
## [1] 0.000000e+00 2.359225e-04 2.356977e-04 2.927957e-04 7.050701e-04
## [6] 2.644715e-04 9.213230e-04 1.553672e-03 2.303251e-03 1.840173e-03
## [11] 2.456674e-03 1.249299e-03 9.830293e-04 7.007315e-03 7.099481e-03
## [16] 6.980677e-03 1.177488e-02 8.743184e-03 8.951232e-03 8.086782e-03
## [21] 1.399878e-02 9.505689e-03 1.378163e-02 1.103868e-02 8.361706e-03
## [26] 9.376432e-03 7.481071e-03 1.067159e-02 8.617186e-03 1.119626e-02
## [31] 1.263900e-02 1.515569e-02 1.552964e-02 1.287402e-02 1.805667e-02
## [36] 6.502201e-04 1.024167e-03 8.297190e-04 1.896485e-03 2.395193e-04
## [41] 3.065082e-04 1.361360e-03 1.519391e-03 1.720133e-03 1.307971e-03
## [46] 1.855684e-03 2.046086e-03 2.505904e-03 2.782402e-03 2.399127e-03
## [51] 9.262685e-04 4.164331e-04 5.314158e-04 4.888171e-04 8.182544e-05
## [56] 7.631796e-05 1.599418e-04 8.991807e-05 1.664945e-03 1.089807e-03
## [61] 1.607286e-03 1.858944e-03 1.168373e-03 1.350794e-03 1.482524e-04
## [66] 3.317977e-04 4.211538e-04 5.796344e-04 1.452402e-03 1.369452e-03
## [71] 9.209858e-04 1.713389e-03 9.589762e-04 2.125101e-03 8.164561e-04
## [76] 9.980906e-04 1.070137e-03 9.076105e-04 5.635615e-04 8.716433e-04
## [81] 3.912560e-04 4.304828e-04 3.402275e-04 3.280886e-04 4.975841e-04
## [86] 5.735649e-04 5.120834e-04 7.327199e-04 5.651351e-04 5.248967e-04
## [91] 4.554350e-04 3.709120e-04 4.013718e-04 2.521078e-04 4.116000e-04
## [96] 1.300665e-03 1.221987e-03 1.287065e-03 8.491638e-04 7.000122e-04
## [101] 1.599867e-03 1.213894e-03 2.500172e-03 2.307410e-03 1.498035e-03
## [106] 1.419582e-03 1.853211e-03 1.403284e-03 1.367879e-03 2.892102e-03
## [111] 1.142072e-03 1.062382e-03 1.314715e-03 2.425540e-03 1.528495e-03
## [116] 1.854785e-03 1.407892e-03 1.625944e-03 1.396652e-03 1.556032e-03
## [121] 7.043957e-04 7.342934e-04 9.741499e-04 1.619200e-03 1.035969e-03
## [126] 1.828709e-03 4.282685e-03 2.841748e-03 3.586719e-03 9.834002e-03
## [131] 3.751157e-03 1.333732e-02 6.560984e-03 3.636062e-03 1.090088e-02
## [136] 6.198277e-03 3.555361e-03 3.889069e-03 2.736656e-03 6.049238e-03
## [141] 3.198611e-03 1.823449e-02 3.725677e-02 4.598275e-02 3.117257e-02
## [146] 2.667217e-02 2.415121e-02 2.655168e-02 2.612873e-02 3.065813e-02
## [151] 1.854875e-02 1.674724e-02 1.259145e-02 2.408523e-02 1.582030e-02
## [156] 3.966162e-02 2.742906e-02 1.368171e-02 1.502216e-02 1.594585e-02
## [161] 1.424235e-02 1.637678e-02 2.054010e-02 1.700238e-02 2.513255e-02
## [166] 3.279402e-02 2.252302e-02 2.016368e-02 2.578491e-02 2.746109e-02
## [171] 1.350007e-02 2.593664e-02 1.492865e-03 9.605498e-04 8.783872e-04
## [176] 6.779823e-04 7.182206e-04 5.387216e-04 6.755095e-04 5.786228e-04
## [181] 6.694400e-04 7.031593e-04 9.521200e-04 1.053840e-03 8.627639e-04
## [186] 6.086329e-04 5.586160e-04 8.140957e-04 1.342814e-03 8.697325e-04
## [191] 9.481861e-04 7.057445e-04 9.027774e-04 1.747783e-04 9.070485e-05
## [196] 8.418579e-05 3.797915e-04 4.534119e-04 3.524788e-04 2.830171e-04
## [201] 1.288076e-04 3.161744e-04 1.736543e-04 3.234803e-04 1.547715e-04
## [206] 1.462293e-03 2.510625e-03 2.761272e-03 1.456111e-03 4.826240e-03
## [211] 1.889853e-03 4.152641e-03 2.370128e-03 1.508376e-03 3.183437e-03
## [216] 2.154662e-03 4.414977e-04 7.172090e-04 1.173094e-03 1.213107e-03
## [221] 3.953810e-03 4.511527e-03 6.937629e-03 6.838045e-03 3.473198e-03
## [226] 5.851531e-03 4.224126e-03 4.564016e-03 3.280548e-03 4.894465e-03
## [231] 5.964715e-03 5.132524e-03 6.395086e-03 3.654608e-03 4.963365e-03
## [236] 3.643143e-03 5.780158e-03 5.681811e-03 8.555704e-04 9.688672e-04
## [241] 1.202317e-03 1.121728e-03 1.085536e-03 1.362821e-03 2.245254e-03
## [246] 2.079468e-03 3.748572e-03 2.138364e-03 1.776669e-03 2.072724e-03
## [251] 1.505903e-03 2.335285e-03 8.529853e-04 4.075761e-03 4.706087e-04
## [256] 3.277514e-04 1.018322e-04 6.802527e-03 7.386657e-03 7.309552e-03
## [261] 5.999671e-03 5.932345e-03 5.775213e-03 9.204688e-03 6.111619e-03
## [266] 8.489390e-03 8.760043e-03 6.429367e-03 6.004054e-03 9.478489e-04
## [271] 3.291451e-03 1.751042e-03 1.217041e-03 2.422842e-03 5.633367e-04
## [276] 1.008431e-03 1.105655e-03 6.176248e-04 8.256727e-04 2.293585e-03
## [281] 3.311233e-04 3.453978e-04 6.178495e-04 9.767350e-05 3.079694e-05
## [286] 5.215248e-05 1.498260e-04 3.640558e-04 4.448697e-04 4.119372e-04
## [291] 3.225811e-04 8.153321e-04 3.352820e-04 8.579308e-04 8.505126e-04
## [296] 1.382490e-03 5.327646e-04 1.514108e-03 6.557275e-04 5.540077e-04
## [301] 4.254249e-04 3.265150e-04 9.704408e-04 1.052941e-03 5.488374e-04
## [306] 5.447911e-04 7.722838e-04 4.833096e-04 5.469941e-03 3.856136e-03
## [311] 2.955112e-02 8.812983e-03 2.870297e-03 2.956731e-03 4.078684e-03
## [316] 2.778918e-03 3.506243e-03 2.685178e-03 4.447573e-03 5.273133e-03
## [321] 1.812748e-03 1.969993e-03 3.875694e-03 3.120157e-03 3.762734e-03
## [326] 2.085425e-03 3.339894e-03 2.638084e-03 6.726996e-04 6.847261e-04
## [331] 4.396994e-04 4.935378e-04 3.185348e-04 5.002817e-04 3.491069e-04
## [336] 3.741716e-04 3.141513e-04 2.707658e-04 3.005512e-04 5.468143e-04
## [341] 6.203223e-04 7.519399e-05 2.097339e-04 2.147918e-04 2.716650e-04
## [346] 2.788584e-04 6.215587e-04 1.391482e-04 9.767350e-05 2.548053e-04
## [351] 6.270661e-04 8.225256e-04 7.431729e-04 1.210522e-04 4.123868e-04
## [356] 1.127011e-03 1.008953e-01 4.319866e-02 5.839561e-02 4.782506e-02
## [361] 5.097905e-02 4.305412e-02 4.127127e-02 4.738761e-02 3.897903e-02
## [366] 5.113585e-02 4.148179e-02 5.498378e-02 6.365817e-02 7.342305e-02
## [371] 9.285086e-02 8.976251e-02 9.155256e-02 7.818185e-02 5.942157e-02
## [376] 9.709398e-02 9.790313e-02 6.592939e-02 8.616062e-02 8.314690e-02
## [381] 7.886118e-02 7.917399e-02 9.875027e-02 8.273350e-02 9.537846e-02
## [386] 7.235853e-02 6.265885e-02 6.361760e-02 7.060536e-02 8.451950e-02
## [391] 7.543452e-02 6.108607e-02 5.714125e-02 9.263551e-02 5.334446e-02
## [396] 5.240549e-02 9.210151e-02 8.706216e-02 7.637248e-02 5.401615e-02
## [401] 4.143863e-02 7.472866e-02 6.535729e-02 8.804103e-02 3.548707e-02
## [406] 4.235883e-02 4.963433e-02 4.880832e-02 4.531972e-02 4.004007e-02
## [411] 5.215889e-02 9.047410e-02 7.178609e-02 5.468244e-02 6.538943e-02
## [416] 6.408753e-02 6.434582e-02 3.160688e-02 2.666352e-02 4.122013e-02
## [421] 6.390286e-02 5.428073e-02 1.624595e-03 1.989999e-03 2.260765e-03
## [426] 1.117457e-03 1.180175e-03 1.876927e-03 3.071264e-03 1.940769e-03
## [431] 3.183999e-03 2.945491e-03 2.616616e-03 1.927731e-03 2.450942e-03
## [436] 6.329108e-04 4.377886e-04 6.118925e-04 1.160730e-03 4.618417e-04
Car Dataset
# Data Source: https://archive.ics.uci.edu/ml/machine-learning-databases/car/
library(randomForest)
# Load the dataset and explore
data1 <- read.table(url("https://archive.ics.uci.edu/ml/machine-learning-databases/car/car.data"), sep = ",")
names(data1) <- c("BuyingPrice", "Maintenance", "NumDoors", "NUmPersons", "BootSpace", "Safety", "Condition")
head(data1)
## BuyingPrice Maintenance NumDoors NUmPersons BootSpace Safety Condition
## 1 vhigh vhigh 2 2 small low unacc
## 2 vhigh vhigh 2 2 small med unacc
## 3 vhigh vhigh 2 2 small high unacc
## 4 vhigh vhigh 2 2 med low unacc
## 5 vhigh vhigh 2 2 med med unacc
## 6 vhigh vhigh 2 2 med high unacc
str(data1)
## 'data.frame': 1728 obs. of 7 variables:
## $ BuyingPrice: Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Maintenance: Factor w/ 4 levels "high","low","med",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ NumDoors : Factor w/ 4 levels "2","3","4","5more": 1 1 1 1 1 1 1 1 1 1 ...
## $ NUmPersons : Factor w/ 3 levels "2","4","more": 1 1 1 1 1 1 1 1 1 2 ...
## $ BootSpace : Factor w/ 3 levels "big","med","small": 3 3 3 2 2 2 1 1 1 3 ...
## $ Safety : Factor w/ 3 levels "high","low","med": 2 3 1 2 3 1 2 3 1 2 ...
## $ Condition : Factor w/ 4 levels "acc","good","unacc",..: 3 3 3 3 3 3 3 3 3 3 ...
summary(data1)
## BuyingPrice Maintenance NumDoors NUmPersons BootSpace Safety
## high :432 high :432 2 :432 2 :576 big :576 high:576
## low :432 low :432 3 :432 4 :576 med :576 low :576
## med :432 med :432 4 :432 more:576 small:576 med :576
## vhigh:432 vhigh:432 5more:432
## Condition
## acc : 384
## good : 69
## unacc:1210
## vgood: 65
# Split into Train and Validation sets
# Training Set : Validation Set = 70 : 30 (random)
set.seed(100)
train <- sample(nrow(data1), 0.7*nrow(data1), replace = FALSE)
TrainSet <- data1[train,]
ValidSet <- data1[-train,]
summary(TrainSet)
## BuyingPrice Maintenance NumDoors NUmPersons BootSpace Safety
## high :298 high :303 2 :312 2 :407 big :406 high:396
## low :300 low :302 3 :298 4 :409 med :393 low :412
## med :306 med :312 4 :299 more:393 small:410 med :401
## vhigh:305 vhigh:292 5more:300
## Condition
## acc :260
## good : 46
## unacc:856
## vgood: 47
summary(ValidSet)
## BuyingPrice Maintenance NumDoors NUmPersons BootSpace Safety
## high :134 high :129 2 :120 2 :169 big :170 high:180
## low :132 low :130 3 :134 4 :167 med :183 low :164
## med :126 med :120 4 :133 more:183 small:166 med :175
## vhigh:127 vhigh:140 5more:132
## Condition
## acc :124
## good : 23
## unacc:354
## vgood: 18
summary(TrainSet)
## BuyingPrice Maintenance NumDoors NUmPersons BootSpace Safety
## high :298 high :303 2 :312 2 :407 big :406 high:396
## low :300 low :302 3 :298 4 :409 med :393 low :412
## med :306 med :312 4 :299 more:393 small:410 med :401
## vhigh:305 vhigh:292 5more:300
## Condition
## acc :260
## good : 46
## unacc:856
## vgood: 47
summary(ValidSet)
## BuyingPrice Maintenance NumDoors NUmPersons BootSpace Safety
## high :134 high :129 2 :120 2 :169 big :170 high:180
## low :132 low :130 3 :134 4 :167 med :183 low :164
## med :126 med :120 4 :133 more:183 small:166 med :175
## vhigh:127 vhigh:140 5more:132
## Condition
## acc :124
## good : 23
## unacc:354
## vgood: 18
# Create a Random Forest model with default parameters
model1 <- randomForest(Condition ~ ., data = TrainSet, importance = TRUE)
model1
##
## Call:
## randomForest(formula = Condition ~ ., data = TrainSet, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 3.64%
## Confusion matrix:
## acc good unacc vgood class.error
## acc 255 2 2 1 0.01923077
## good 7 35 0 4 0.23913043
## unacc 20 2 834 0 0.02570093
## vgood 6 0 0 41 0.12765957
# Fine tuning parameters of Random Forest model
model2 <- randomForest(Condition ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE)
model2
##
## Call:
## randomForest(formula = Condition ~ ., data = TrainSet, ntree = 500, mtry = 6, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 6
##
## OOB estimate of error rate: 2.32%
## Confusion matrix:
## acc good unacc vgood class.error
## acc 248 6 5 1 0.04615385
## good 4 42 0 0 0.08695652
## unacc 8 2 846 0 0.01168224
## vgood 2 0 0 45 0.04255319
# Predicting on train set
predTrain <- predict(model2, TrainSet, type = "class")
# Checking classification accuracy
table(predTrain, TrainSet$Condition)
##
## predTrain acc good unacc vgood
## acc 260 0 0 0
## good 0 46 0 0
## unacc 0 0 856 0
## vgood 0 0 0 47
# Predicting on Validation set
predValid <- predict(model2, ValidSet, type = "class")
# Checking classification accuracy
mean(predValid == ValidSet$Condition)
## [1] 0.9845857
table(predValid,ValidSet$Condition)
##
## predValid acc good unacc vgood
## acc 120 1 2 1
## good 1 22 0 0
## unacc 3 0 352 0
## vgood 0 0 0 17
# To check important variables
importance(model2)
## acc good unacc vgood MeanDecreaseAccuracy
## BuyingPrice 144.13929 75.96633 111.10092 80.70126 200.4809
## Maintenance 134.88327 69.62648 104.31162 50.07345 182.8435
## NumDoors 32.35052 17.55486 47.57988 20.17438 57.3249
## NUmPersons 150.37837 50.89904 186.53684 57.04931 237.0746
## BootSpace 85.05941 55.85293 83.13938 63.39719 144.7800
## Safety 176.85992 82.14649 201.91053 110.32306 277.8490
## MeanDecreaseGini
## BuyingPrice 68.49384
## Maintenance 91.02632
## NumDoors 33.93850
## NUmPersons 122.51556
## BootSpace 75.31990
## Safety 151.59471
varImpPlot(model2)
# Using For loop to identify the right mtry for model
a=c()
i=5
for (i in 3:8) {
model3 <- randomForest(Condition ~ ., data = TrainSet, ntree = 500, mtry = i, importance = TRUE)
predValid <- predict(model3, ValidSet, type = "class")
a[i-2] = mean(predValid == ValidSet$Condition)
}
## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within valid
## range
## Warning in randomForest.default(m, y, ...): invalid mtry: reset to within valid
## range
a
## [1] 0.9730250 0.9788054 0.9749518 0.9826590 0.9826590 0.9807322
plot(3:8,a)
library(rpart)
library(caret)
## Carregando pacotes exigidos: lattice
## Carregando pacotes exigidos: ggplot2
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
##
## margin
library(e1071)
# We will compare model 1 of Random Forest with Decision Tree model
model_dt = train(Condition ~ ., data = TrainSet, method = "rpart")
model_dt_1 = predict(model_dt, data = TrainSet)
table(model_dt_1, TrainSet$Condition)
##
## model_dt_1 acc good unacc vgood
## acc 202 38 75 47
## good 0 0 0 0
## unacc 58 8 781 0
## vgood 0 0 0 0
mean(model_dt_1 == TrainSet$Condition)
## [1] 0.8130687
# Running on Validation Set
model_dt_vs = predict(model_dt, newdata = ValidSet)
table(model_dt_vs, ValidSet$Condition)
##
## model_dt_vs acc good unacc vgood
## acc 101 22 41 18
## good 0 0 0 0
## unacc 23 1 313 0
## vgood 0 0 0 0
mean(model_dt_vs == ValidSet$Condition)
## [1] 0.7976879