By Salerno | January 13, 2020
1 - Objective
The objective of this example is to identify each of a number of benign or malignant classes.
2 - Data
Let’s getting the data.
BCData <- read.table(url("https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data"), sep = ",")
# setting column names
names(BCData)<- c('Id', 'ClumpThickness', 'CellSize','CellShape',
'MarginalAdhesion','SECellSize', 'BareNuclei',
'BlandChromatin','NormalNucleoli',
'Mitoses','Class')
3 - EDA - Exploratory Data Analysis
It’s important to extract prelimionary knowledge from the dataset.
dim(BCData)
## [1] 699 11
str(BCData)
## 'data.frame': 699 obs. of 11 variables:
## $ Id : int 1000025 1002945 1015425 1016277 1017023 1017122 1018099 1018561 1033078 1033078 ...
## $ ClumpThickness : int 5 5 3 6 4 8 1 2 2 4 ...
## $ CellSize : int 1 4 1 8 1 10 1 1 1 2 ...
## $ CellShape : int 1 4 1 8 1 10 1 2 1 1 ...
## $ MarginalAdhesion: int 1 5 1 1 3 8 1 1 1 1 ...
## $ SECellSize : int 2 7 2 3 2 7 2 2 2 2 ...
## $ BareNuclei : Factor w/ 11 levels "?","1","10","2",..: 2 3 4 6 2 3 3 2 2 2 ...
## $ BlandChromatin : int 3 3 3 3 3 9 3 3 1 2 ...
## $ NormalNucleoli : int 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : int 1 1 1 1 1 1 1 1 5 1 ...
## $ Class : int 2 2 2 2 2 4 2 2 2 2 ...
# Remove rows with missing values
BCData<-BCData[!(BCData$BareNuclei=="?"),]
# convert variable to an integer
BCData$BareNuclei<-as.integer(BCData$BareNuclei)
str(BCData)
## 'data.frame': 683 obs. of 11 variables:
## $ Id : int 1000025 1002945 1015425 1016277 1017023 1017122 1018099 1018561 1033078 1033078 ...
## $ ClumpThickness : int 5 5 3 6 4 8 1 2 2 4 ...
## $ CellSize : int 1 4 1 8 1 10 1 1 1 2 ...
## $ CellShape : int 1 4 1 8 1 10 1 2 1 1 ...
## $ MarginalAdhesion: int 1 5 1 1 3 8 1 1 1 1 ...
## $ SECellSize : int 2 7 2 3 2 7 2 2 2 2 ...
## $ BareNuclei : int 2 3 4 6 2 3 3 2 2 2 ...
## $ BlandChromatin : int 3 3 3 3 3 9 3 3 1 2 ...
## $ NormalNucleoli : int 1 2 1 7 1 7 1 1 1 1 ...
## $ Mitoses : int 1 1 1 1 1 1 1 1 5 1 ...
## $ Class : int 2 2 2 2 2 4 2 2 2 2 ...
summary(BCData)
## Id ClumpThickness CellSize CellShape
## Min. : 63375 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 877617 1st Qu.: 2.000 1st Qu.: 1.000 1st Qu.: 1.000
## Median : 1171795 Median : 4.000 Median : 1.000 Median : 1.000
## Mean : 1076720 Mean : 4.442 Mean : 3.151 Mean : 3.215
## 3rd Qu.: 1238705 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.: 5.000
## Max. :13454352 Max. :10.000 Max. :10.000 Max. :10.000
## MarginalAdhesion SECellSize BareNuclei BlandChromatin
## Min. : 1.00 Min. : 1.000 Min. : 2.000 Min. : 1.000
## 1st Qu.: 1.00 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 2.000
## Median : 1.00 Median : 2.000 Median : 2.000 Median : 3.000
## Mean : 2.83 Mean : 3.234 Mean : 3.217 Mean : 3.445
## 3rd Qu.: 4.00 3rd Qu.: 4.000 3rd Qu.: 3.000 3rd Qu.: 5.000
## Max. :10.00 Max. :10.000 Max. :11.000 Max. :10.000
## NormalNucleoli Mitoses Class
## Min. : 1.00 Min. : 1.000 Min. :2.0
## 1st Qu.: 1.00 1st Qu.: 1.000 1st Qu.:2.0
## Median : 1.00 Median : 1.000 Median :2.0
## Mean : 2.87 Mean : 1.603 Mean :2.7
## 3rd Qu.: 4.00 3rd Qu.: 1.000 3rd Qu.:4.0
## Max. :10.00 Max. :10.000 Max. :4.0
table(BCData$Class)
##
## 2 4
## 444 239
The number 2 means “benign” and number 4 means “malignant”.
boxplot(BCData[,2:10])
To better identify the presence of outliers, we can plot histograms of the variables in the database
par(mfrow=c(3, 3))
hist(BCData$ClumpThickness)
hist(BCData$CellSize)
hist(BCData$CellShape)
hist(BCData$MarginalAdhesion)
hist(BCData$SECellSize)
hist(as.numeric(BCData$BareNuclei))
hist(BCData$BlandChromatin)
hist(BCData$NormalNucleoli)
hist(BCData$Mitoses)
4 - Model Fitting
BCData$Class<-replace(BCData$Class,BCData$Class==2,0)
BCData$Class<-replace(BCData$Class,BCData$Class==4,1)
table(BCData$Class)
##
## 0 1
## 444 239
It is time to build the logistic regression model to obtain a tool capable of predicting the class of new observations.
LoGModel <- glm(Class ~.-Id,
family=binomial(link='logit'),data=BCData)
summary(LoGModel)
##
## Call:
## glm(formula = Class ~ . - Id, family = binomial(link = "logit"),
## data = BCData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6922 -0.1128 -0.0588 0.0206 2.6233
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.05697 1.24336 -8.893 < 2e-16 ***
## ClumpThickness 0.58181 0.12937 4.497 6.88e-06 ***
## CellSize -0.05776 0.20026 -0.288 0.773024
## CellShape 0.52133 0.21391 2.437 0.014802 *
## MarginalAdhesion 0.38661 0.10949 3.531 0.000414 ***
## SECellSize 0.18570 0.14787 1.256 0.209179
## BareNuclei 0.21915 0.11224 1.952 0.050887 .
## BlandChromatin 0.61953 0.16219 3.820 0.000134 ***
## NormalNucleoli 0.14997 0.10936 1.371 0.170251
## Mitoses 0.58935 0.35733 1.649 0.099087 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 884.35 on 682 degrees of freedom
## Residual deviance: 117.91 on 673 degrees of freedom
## AIC: 137.91
##
## Number of Fisher Scoring iterations: 8
anova(LoGModel, test="Chisq")
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Class
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 682 884.35
## ClumpThickness 1 425.87 681 458.48 < 2.2e-16 ***
## CellSize 1 261.91 680 196.58 < 2.2e-16 ***
## CellShape 1 20.08 679 176.50 7.427e-06 ***
## MarginalAdhesion 1 21.39 678 155.11 3.750e-06 ***
## SECellSize 1 6.45 677 148.66 0.011112 *
## BareNuclei 1 6.98 676 141.68 0.008248 **
## BlandChromatin 1 18.42 675 123.26 1.767e-05 ***
## NormalNucleoli 1 2.09 674 121.17 0.148710
## Mitoses 1 3.26 673 117.91 0.070868 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
LGModelPred <- round(predict(LoGModel, type="response"))
table(LGModelPred)
## LGModelPred
## 0 1
## 440 243
table(BCData$Class, LGModelPred)
## LGModelPred
## 0 1
## 0 432 12
## 1 8 231
library(caret)
## Carregando pacotes exigidos: lattice
## Carregando pacotes exigidos: ggplot2
confusionMatrix(table(LGModelPred, BCData$Class), positive="1")
## Confusion Matrix and Statistics
##
##
## LGModelPred 0 1
## 0 432 8
## 1 12 231
##
## Accuracy : 0.9707
## 95% CI : (0.9551, 0.982)
## No Information Rate : 0.6501
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9359
##
## Mcnemar's Test P-Value : 0.5023
##
## Sensitivity : 0.9665
## Specificity : 0.9730
## Pos Pred Value : 0.9506
## Neg Pred Value : 0.9818
## Prevalence : 0.3499
## Detection Rate : 0.3382
## Detection Prevalence : 0.3558
## Balanced Accuracy : 0.9698
##
## 'Positive' Class : 1
##
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
RocObj <- pROC::roc(BCData$Class, LGModelPred, levels = c(0, 1), direction = "<")
print(RocObj)
##
## Call:
## roc.default(response = BCData$Class, predictor = LGModelPred, levels = c(0, 1), direction = "<")
##
## Data: LGModelPred in 444 controls (BCData$Class 0) < 239 cases (BCData$Class 1).
## Area under the curve: 0.9698
plot.roc(RocObj)
plot(RocObj, print.auc=TRUE, auc.polygon=TRUE, grid=c(0.1, 0.2),
grid.col=c("green", "red"), max.auc.polygon=TRUE,
auc.polygon.col="blue", print.thres=TRUE)
comments powered by Disqus