By Salerno | January 5, 2020
1 - Introduction
Could the Machine Learning Algorithms detect beforehand any abnormal cell process?
We know that this clinical battle is not so easy and there are a lot of people envolved in this process trying to identify a clear path to the cure.
In complement to the decision human process, coult the technology decrease the subjective bias inherently in the process and improve our decisions?
We absolutely know that the human being process is limited when compared to high capacity of the computers.
If we combine the experience of human beings with the increase capacity of the CPUs, certainly we will achieve new levels.
2 - Collecting the data
We will utilize the “Breast Cancer Wisconsin Diagnostic” dataset from the UCI Machine Learning Repository, which is available at http://archive.ics.uci.edu/ml.
path <- "C:/Users/andre/OneDrive/Área de Trabalho/salerno/blogdown/datasets/breast_cancer"
path <- paste0(path, "/wisc_bc_data.csv")
wbcd <- read.csv(path, stringsAsFactors = FALSE)
In this dataset there are around 569 events and 32 features. Lets check it out below:
dim(wbcd)
## [1] 569 32
3 - Exploring the data
#drop the id column
wbcd <- wbcd[-1]
str(wbcd)
## 'data.frame': 569 obs. of 31 variables:
## $ diagnosis : chr "B" "B" "B" "B" ...
## $ radius_mean : num 12.3 10.6 11 11.3 15.2 ...
## $ texture_mean : num 12.4 18.9 16.8 13.4 13.2 ...
## $ perimeter_mean : num 78.8 69.3 70.9 73 97.7 ...
## $ area_mean : num 464 346 373 385 712 ...
## $ smoothness_mean : num 0.1028 0.0969 0.1077 0.1164 0.0796 ...
## $ compactness_mean : num 0.0698 0.1147 0.078 0.1136 0.0693 ...
## $ concavity_mean : num 0.0399 0.0639 0.0305 0.0464 0.0339 ...
## $ points_mean : num 0.037 0.0264 0.0248 0.048 0.0266 ...
## $ symmetry_mean : num 0.196 0.192 0.171 0.177 0.172 ...
## $ dimension_mean : num 0.0595 0.0649 0.0634 0.0607 0.0554 ...
## $ radius_se : num 0.236 0.451 0.197 0.338 0.178 ...
## $ texture_se : num 0.666 1.197 1.387 1.343 0.412 ...
## $ perimeter_se : num 1.67 3.43 1.34 1.85 1.34 ...
## $ area_se : num 17.4 27.1 13.5 26.3 17.7 ...
## $ smoothness_se : num 0.00805 0.00747 0.00516 0.01127 0.00501 ...
## $ compactness_se : num 0.0118 0.03581 0.00936 0.03498 0.01485 ...
## $ concavity_se : num 0.0168 0.0335 0.0106 0.0219 0.0155 ...
## $ points_se : num 0.01241 0.01365 0.00748 0.01965 0.00915 ...
## $ symmetry_se : num 0.0192 0.035 0.0172 0.0158 0.0165 ...
## $ dimension_se : num 0.00225 0.00332 0.0022 0.00344 0.00177 ...
## $ radius_worst : num 13.5 11.9 12.4 11.9 16.2 ...
## $ texture_worst : num 15.6 22.9 26.4 15.8 15.7 ...
## $ perimeter_worst : num 87 78.3 79.9 76.5 104.5 ...
## $ area_worst : num 549 425 471 434 819 ...
## $ smoothness_worst : num 0.139 0.121 0.137 0.137 0.113 ...
## $ compactness_worst: num 0.127 0.252 0.148 0.182 0.174 ...
## $ concavity_worst : num 0.1242 0.1916 0.1067 0.0867 0.1362 ...
## $ points_worst : num 0.0939 0.0793 0.0743 0.0861 0.0818 ...
## $ symmetry_worst : num 0.283 0.294 0.3 0.21 0.249 ...
## $ dimension_worst : num 0.0677 0.0759 0.0788 0.0678 0.0677 ...
wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"),
labels = c(0, 1))
table(wbcd$diagnosis)
##
## 0 1
## 357 212
round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)
##
## 0 1
## 62.7 37.3
4 - Pre-processing
Some R machine learning classifiers require that the target feature is coded as a factor.
summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])
## radius_mean area_mean smoothness_mean
## Min. : 6.981 Min. : 143.5 Min. :0.05263
## 1st Qu.:11.700 1st Qu.: 420.3 1st Qu.:0.08637
## Median :13.370 Median : 551.1 Median :0.09587
## Mean :14.127 Mean : 654.9 Mean :0.09636
## 3rd Qu.:15.780 3rd Qu.: 782.7 3rd Qu.:0.10530
## Max. :28.110 Max. :2501.0 Max. :0.16340
5 - Transformation – normalizing numeric data
# defining a function
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x)))
}
Let’s check if the normalize function is working
normalize(c(1, 2, 3, 4, 5))
## [1] 0.00 0.25 0.50 0.75 1.00
normalize(c(10, 20, 30, 40, 50))
## [1] 0.00 0.25 0.50 0.75 1.00
wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))
summary(wbcd_n$area_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1174 0.1729 0.2169 0.2711 1.0000
6 - Data preparation – creating training and test datasets
wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
7 - Training a model on the data
library(class)
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
cl = wbcd_train_labels, k=21)
class(wbcd_test_pred)
## [1] "factor"
8 - Evaluating Model Performance
library(gmodels)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_test_pred
## wbcd_test_labels | 0 | 1 | Row Total |
## -----------------|-----------|-----------|-----------|
## 0 | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.968 | 0.000 | |
## | 0.610 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## 1 | 2 | 37 | 39 |
## | 0.051 | 0.949 | 0.390 |
## | 0.032 | 1.000 | |
## | 0.020 | 0.370 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 63 | 37 | 100 |
## | 0.630 | 0.370 | |
## -----------------|-----------|-----------|-----------|
##
##
9 - Improving Model Performance
wbcd_z <- as.data.frame(scale(wbcd[-1]))
summary(wbcd_z$area_mean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.4532 -0.6666 -0.2949 0.0000 0.3632 5.2459
wbcd_train <- wbcd_z[1:469, ]
wbcd_test <- wbcd_z[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
cl = wbcd_train_labels, k=21)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
prop.chisq=FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 100
##
##
## | wbcd_test_pred
## wbcd_test_labels | 0 | 1 | Row Total |
## -----------------|-----------|-----------|-----------|
## 0 | 61 | 0 | 61 |
## | 1.000 | 0.000 | 0.610 |
## | 0.924 | 0.000 | |
## | 0.610 | 0.000 | |
## -----------------|-----------|-----------|-----------|
## 1 | 5 | 34 | 39 |
## | 0.128 | 0.872 | 0.390 |
## | 0.076 | 1.000 | |
## | 0.050 | 0.340 | |
## -----------------|-----------|-----------|-----------|
## Column Total | 66 | 34 | 100 |
## | 0.660 | 0.340 | |
## -----------------|-----------|-----------|-----------|
##
##
library(caret)
## Carregando pacotes exigidos: lattice
## Carregando pacotes exigidos: ggplot2
confusionMatrix(wbcd_test_labels, wbcd_test_pred, positive = "0")
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 61 0
## 1 5 34
##
## Accuracy : 0.95
## 95% CI : (0.8872, 0.9836)
## No Information Rate : 0.66
## P-Value [Acc > NIR] : 2.729e-12
##
## Kappa : 0.8924
##
## Mcnemar's Test P-Value : 0.07364
##
## Sensitivity : 0.9242
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.8718
## Prevalence : 0.6600
## Detection Rate : 0.6100
## Detection Prevalence : 0.6100
## Balanced Accuracy : 0.9621
##
## 'Positive' Class : 0
##
library(vcd)
## Carregando pacotes exigidos: grid
Kappa(table(wbcd_test_labels, wbcd_test_pred))
## value ASE z Pr(>|z|)
## Unweighted 0.8924 0.04662 19.14 1.098e-81
## Weighted 0.8924 0.04662 19.14 1.098e-81
library(caret)
sensitivity(wbcd_test_pred, wbcd_test_labels,
positive = "1")
## [1] 0.8717949
specificity(wbcd_test_pred, wbcd_test_labels,
negative = "0")
## [1] 1
posPredValue(wbcd_test_pred, wbcd_test_labels,
positive = "1")
## [1] 1
10 - Conclusion
In this post we learned about classification using k-nearest neighbors. Unlike many classification algorithms, kNN does not do any learning.
This algorithm simply stores the tarining data verbatim. Unlabeled test examples are then matched to the most similar records in the training set using a distance function, and the unlabeled example is assigned the label of its neighbors.
Even the KNN algorithm is classified as a simple algorithm, it is capable of tackling complex tasks.
They do not need any mathematics assumptions and you do not need the most updated and strong specificationbs in terms of hardware.
The most important assumption is that events near each other is considered as similar.