Diagnosing breast cancer with the kNN algorithm

By Salerno | January 5, 2020

1 - Introduction

Could the Machine Learning Algorithms detect beforehand any abnormal cell process?

We know that this clinical battle is not so easy and there are a lot of people envolved in this process trying to identify a clear path to the cure.

In complement to the decision human process, coult the technology decrease the subjective bias inherently in the process and improve our decisions?

We absolutely know that the human being process is limited when compared to high capacity of the computers.

If we combine the experience of human beings with the increase capacity of the CPUs, certainly we will achieve new levels.

2 - Collecting the data

We will utilize the “Breast Cancer Wisconsin Diagnostic” dataset from the UCI Machine Learning Repository, which is available at http://archive.ics.uci.edu/ml.


path <- "C:/Users/andre/OneDrive/Área de Trabalho/salerno/blogdown/datasets/breast_cancer"

path <- paste0(path, "/wisc_bc_data.csv")

wbcd <- read.csv(path, stringsAsFactors = FALSE)

In this dataset there are around 569 events and 32 features. Lets check it out below:


dim(wbcd)
## [1] 569  32

3 - Exploring the data


#drop the id column
wbcd <- wbcd[-1]

str(wbcd)
## 'data.frame':    569 obs. of  31 variables:
##  $ diagnosis        : chr  "B" "B" "B" "B" ...
##  $ radius_mean      : num  12.3 10.6 11 11.3 15.2 ...
##  $ texture_mean     : num  12.4 18.9 16.8 13.4 13.2 ...
##  $ perimeter_mean   : num  78.8 69.3 70.9 73 97.7 ...
##  $ area_mean        : num  464 346 373 385 712 ...
##  $ smoothness_mean  : num  0.1028 0.0969 0.1077 0.1164 0.0796 ...
##  $ compactness_mean : num  0.0698 0.1147 0.078 0.1136 0.0693 ...
##  $ concavity_mean   : num  0.0399 0.0639 0.0305 0.0464 0.0339 ...
##  $ points_mean      : num  0.037 0.0264 0.0248 0.048 0.0266 ...
##  $ symmetry_mean    : num  0.196 0.192 0.171 0.177 0.172 ...
##  $ dimension_mean   : num  0.0595 0.0649 0.0634 0.0607 0.0554 ...
##  $ radius_se        : num  0.236 0.451 0.197 0.338 0.178 ...
##  $ texture_se       : num  0.666 1.197 1.387 1.343 0.412 ...
##  $ perimeter_se     : num  1.67 3.43 1.34 1.85 1.34 ...
##  $ area_se          : num  17.4 27.1 13.5 26.3 17.7 ...
##  $ smoothness_se    : num  0.00805 0.00747 0.00516 0.01127 0.00501 ...
##  $ compactness_se   : num  0.0118 0.03581 0.00936 0.03498 0.01485 ...
##  $ concavity_se     : num  0.0168 0.0335 0.0106 0.0219 0.0155 ...
##  $ points_se        : num  0.01241 0.01365 0.00748 0.01965 0.00915 ...
##  $ symmetry_se      : num  0.0192 0.035 0.0172 0.0158 0.0165 ...
##  $ dimension_se     : num  0.00225 0.00332 0.0022 0.00344 0.00177 ...
##  $ radius_worst     : num  13.5 11.9 12.4 11.9 16.2 ...
##  $ texture_worst    : num  15.6 22.9 26.4 15.8 15.7 ...
##  $ perimeter_worst  : num  87 78.3 79.9 76.5 104.5 ...
##  $ area_worst       : num  549 425 471 434 819 ...
##  $ smoothness_worst : num  0.139 0.121 0.137 0.137 0.113 ...
##  $ compactness_worst: num  0.127 0.252 0.148 0.182 0.174 ...
##  $ concavity_worst  : num  0.1242 0.1916 0.1067 0.0867 0.1362 ...
##  $ points_worst     : num  0.0939 0.0793 0.0743 0.0861 0.0818 ...
##  $ symmetry_worst   : num  0.283 0.294 0.3 0.21 0.249 ...
##  $ dimension_worst  : num  0.0677 0.0759 0.0788 0.0678 0.0677 ...


wbcd$diagnosis <- factor(wbcd$diagnosis, levels = c("B", "M"),
                           labels = c(0, 1))

table(wbcd$diagnosis)
## 
##   0   1 
## 357 212

round(prop.table(table(wbcd$diagnosis)) * 100, digits = 1)
## 
##    0    1 
## 62.7 37.3

4 - Pre-processing

Some R machine learning classifiers require that the target feature is coded as a factor.


summary(wbcd[c("radius_mean", "area_mean", "smoothness_mean")])
##   radius_mean       area_mean      smoothness_mean  
##  Min.   : 6.981   Min.   : 143.5   Min.   :0.05263  
##  1st Qu.:11.700   1st Qu.: 420.3   1st Qu.:0.08637  
##  Median :13.370   Median : 551.1   Median :0.09587  
##  Mean   :14.127   Mean   : 654.9   Mean   :0.09636  
##  3rd Qu.:15.780   3rd Qu.: 782.7   3rd Qu.:0.10530  
##  Max.   :28.110   Max.   :2501.0   Max.   :0.16340

5 - Transformation – normalizing numeric data


# defining a function
normalize <- function(x) {
    return ((x - min(x)) / (max(x) - min(x)))
}

Let’s check if the normalize function is working


normalize(c(1, 2, 3, 4, 5))
## [1] 0.00 0.25 0.50 0.75 1.00

normalize(c(10, 20, 30, 40, 50))
## [1] 0.00 0.25 0.50 0.75 1.00


wbcd_n <- as.data.frame(lapply(wbcd[2:31], normalize))


summary(wbcd_n$area_mean)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1174  0.1729  0.2169  0.2711  1.0000

6 - Data preparation – creating training and test datasets


wbcd_train <- wbcd_n[1:469, ]
wbcd_test <- wbcd_n[470:569, ]


wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]

7 - Training a model on the data


library(class)

wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
                        cl = wbcd_train_labels, k=21)

class(wbcd_test_pred)
## [1] "factor"

8 - Evaluating Model Performance


library(gmodels)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
             prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |         0 |         1 | Row Total | 
## -----------------|-----------|-----------|-----------|
##                0 |        61 |         0 |        61 | 
##                  |     1.000 |     0.000 |     0.610 | 
##                  |     0.968 |     0.000 |           | 
##                  |     0.610 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##                1 |         2 |        37 |        39 | 
##                  |     0.051 |     0.949 |     0.390 | 
##                  |     0.032 |     1.000 |           | 
##                  |     0.020 |     0.370 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        63 |        37 |       100 | 
##                  |     0.630 |     0.370 |           | 
## -----------------|-----------|-----------|-----------|
## 
##

9 - Improving Model Performance


wbcd_z <- as.data.frame(scale(wbcd[-1]))


summary(wbcd_z$area_mean)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.4532 -0.6666 -0.2949  0.0000  0.3632  5.2459


wbcd_train <- wbcd_z[1:469, ]
wbcd_test <- wbcd_z[470:569, ]
wbcd_train_labels <- wbcd[1:469, 1]
wbcd_test_labels <- wbcd[470:569, 1]
wbcd_test_pred <- knn(train = wbcd_train, test = wbcd_test,
                        cl = wbcd_train_labels, k=21)
CrossTable(x = wbcd_test_labels, y = wbcd_test_pred,
             prop.chisq=FALSE)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  100 
## 
##  
##                  | wbcd_test_pred 
## wbcd_test_labels |         0 |         1 | Row Total | 
## -----------------|-----------|-----------|-----------|
##                0 |        61 |         0 |        61 | 
##                  |     1.000 |     0.000 |     0.610 | 
##                  |     0.924 |     0.000 |           | 
##                  |     0.610 |     0.000 |           | 
## -----------------|-----------|-----------|-----------|
##                1 |         5 |        34 |        39 | 
##                  |     0.128 |     0.872 |     0.390 | 
##                  |     0.076 |     1.000 |           | 
##                  |     0.050 |     0.340 |           | 
## -----------------|-----------|-----------|-----------|
##     Column Total |        66 |        34 |       100 | 
##                  |     0.660 |     0.340 |           | 
## -----------------|-----------|-----------|-----------|
## 
##


library(caret)
## Carregando pacotes exigidos: lattice
## Carregando pacotes exigidos: ggplot2

confusionMatrix(wbcd_test_labels, wbcd_test_pred, positive = "0")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 61  0
##          1  5 34
##                                           
##                Accuracy : 0.95            
##                  95% CI : (0.8872, 0.9836)
##     No Information Rate : 0.66            
##     P-Value [Acc > NIR] : 2.729e-12       
##                                           
##                   Kappa : 0.8924          
##                                           
##  Mcnemar's Test P-Value : 0.07364         
##                                           
##             Sensitivity : 0.9242          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.8718          
##              Prevalence : 0.6600          
##          Detection Rate : 0.6100          
##    Detection Prevalence : 0.6100          
##       Balanced Accuracy : 0.9621          
##                                           
##        'Positive' Class : 0               
##


library(vcd)
## Carregando pacotes exigidos: grid

Kappa(table(wbcd_test_labels, wbcd_test_pred))
##             value     ASE     z  Pr(>|z|)
## Unweighted 0.8924 0.04662 19.14 1.098e-81
## Weighted   0.8924 0.04662 19.14 1.098e-81


library(caret)

sensitivity(wbcd_test_pred, wbcd_test_labels,
              positive = "1")
## [1] 0.8717949

specificity(wbcd_test_pred, wbcd_test_labels,
              negative = "0")
## [1] 1

posPredValue(wbcd_test_pred, wbcd_test_labels,
               positive = "1")
## [1] 1

10 - Conclusion

In this post we learned about classification using k-nearest neighbors. Unlike many classification algorithms, kNN does not do any learning.

This algorithm simply stores the tarining data verbatim. Unlabeled test examples are then matched to the most similar records in the training set using a distance function, and the unlabeled example is assigned the label of its neighbors.

Even the KNN algorithm is classified as a simple algorithm, it is capable of tackling complex tasks.

They do not need any mathematics assumptions and you do not need the most updated and strong specificationbs in terms of hardware.

The most important assumption is that events near each other is considered as similar.