Modeling with tidymodels in R

By Salerno | May 8, 2022

1) Machine Learning with tidymodels

In this chapter, you’ll explore the rich ecosystem of R packages that power tidymodels and learn how they can streamline your machine learning workflows. You’ll then put your tidymodels skills to the test by predicting house sale prices in Seattle, Washington.

1.1) Tidymodels packages

tidymodels is a collection of machine learning packages designed to simplify the machine learning workflow in R.

In this exercise, you will assign each package within the tidymodels ecosystem to its corresponding process within the machine learning workflow.

  • Data resampling and feature engineering

rsample and recipes

  • Model fitting and tuning

tune, dials and parsnip

  • Model evaluation

yardstick

1.2) Creating training and test datasets

The rsample package is designed to create training and test datasets. Creating a test dataset is important for estimating how a trained model will likely perform on new data. It also guards against overfitting, where a model memorizes patterns that exist only in the training data and performs poorly on new data.

In this exercise, you will create training and test datasets from the home_sales data. This data contains information on homes sold in the Seattle, Washington area between 2015 and 2016.

The outcome variable in this data is selling_price.

The tidymodels package will be pre-loaded in every exercise in the course. The home_sales tibble has also been loaded for you.


home_sales <- readRDS("../../../datasets/tidymodels/home_sales.rds")

dim(home_sales)
## [1] 1492    8

colnames(home_sales)
## [1] "selling_price" "home_age"      "bedrooms"      "bathrooms"    
## [5] "sqft_living"   "sqft_lot"      "sqft_basement" "floors"

#Create a data split object

home_split <- rsample::initial_split(home_sales, prop = 0.7, strata = selling_price)

We have already created an rsample object called home split, that contains the instructions for ramdomly splitting the home_sales data into a training and test dataset.

Also, we allocate 70% of the data into training and stratify the results by selling_price.


# Create the training data

home_training <- home_split %>%
  rsample::training()

We have created a training dataset from home_split called home_training.


home_test <- home_split %>%
  rsample::testing()

# Check number of rows in each dataset
nrow(home_training)
## [1] 1042
nrow(home_test)
## [1] 450

1.3) Distribution of outcome variable values

Stratifying by the outcome variable when generating training and test datasets ensures that the outcome variable values have a similar range in both datasets.

Since the original data is split at random, stratification avoids placing all the expensive homes in home_sales into the test dataset, for example. In this case, your model would most likely perform poorly because it was trained on less expensive homes.

In this exercise, you will calculate summary statistics for the selling_price variable in the training and test datasets. The home_training and home_test tibbles have been loaded from the previous exercise.


# Distribution of selling_price in training data

home_training %>%
  dplyr::summarize(min_sell_price = min(selling_price),
            max_sell_price = max(selling_price),
            mean_sell_price = mean(selling_price),
            sd_sell_price = sd(selling_price))
## # A tibble: 1 x 4
##   min_sell_price max_sell_price mean_sell_price sd_sell_price
##            <dbl>          <dbl>           <dbl>         <dbl>
## 1         350000         650000         478908.        80294.

# Distribution of selling_price in test data
home_test %>% 
  dplyr::summarize(min_sell_price = min(selling_price),
            max_sell_price = max(selling_price),
            mean_sell_price = mean(selling_price),
            sd_sell_price = sd(selling_price))
## # A tibble: 1 x 4
##   min_sell_price max_sell_price mean_sell_price sd_sell_price
##            <dbl>          <dbl>           <dbl>         <dbl>
## 1         350000         650000         479493.        82631.

Excellent work! The minimum and maximum selling prices in both datasets are the same. The mean and standard deviation are also similar. Stratifying by the outcome variable ensures the model fitting process is performed on a representative sample of the original data.

1.4) Fitting a linear regression model

The parsnip package provides a unified syntax for the model fitting process in R.

With parsnip, it is easy to define models using the various packages, or engines, that exist in the R ecosystem.

In this exercise, you will define a parsnip linear regression object and train your model to predict selling_price using home_age and sqft_living as predictor variables from the home_sales data.

The home_training and home_test tibbles that you created in the previous lesson have been loaded into this session.


# Initialize a linear regression object, linear_model
linear_model <- parsnip::linear_reg() %>% 
  # Set the model engine
  parsnip::set_engine('lm') %>% 
  # Set the model mode
  parsnip::set_mode('regression')

# Train the model with the training data
lm_fit <- linear_model %>% 
  parsnip::fit(selling_price ~ home_age + sqft_living,
      data = home_training)

# Print lm_fit to view model information
lm_fit
## parsnip model object
## 
## 
## Call:
## stats::lm(formula = selling_price ~ home_age + sqft_living, data = data)
## 
## Coefficients:
## (Intercept)     home_age  sqft_living  
##    291234.9      -1422.1        102.9

class(lm_fit)
## [1] "_lm"       "model_fit"

Excellent work! You have defined your model with linear_reg() and trained it to predict selling_price using home_age and sqft_living. Printing a parsnip model fit object displays useful model information, such as the training time, model formula used during training, and the estimated model parameters.

1.5) Exploring estimated model parameters

In the previous exercise, you trained a linear regression model to predict selling_price using home_age and sqft_living as predictor variables.

Pass your trained model object into the appropriate function to explore the estimated model parameters and select the true statement.

Your trained model, lm_fit, has been loaded into your session.

library(broom)
## Warning: package 'broom' was built under R version 4.1.3
tidy(lm_fit)
## # A tibble: 3 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)  291235.   7505.       38.8  2.51e-204
## 2 home_age      -1422.    172.       -8.25 4.73e- 16
## 3 sqft_living     103.      2.72     37.9  5.08e-198

Great job! The tidy() function automatically creates a tibble of estimated model parameters. Since sqft_living has a positive estimated parameter, the selling price of homes increases with the square footage. Conversely, since home_age has a negative estimated parameter, older homes tend to have lower selling prices.

1.6) Predicting home selling prices

After fitting a model using the training data, the next step is to use it to make predictions on the test dataset. The test dataset acts as a new source of data for the model and will allow you to evaluate how well it performs.

Before you can evaluate model performance, you must add your predictions to the test dataset.

In this exercise, you will use your trained model, lm_fit, to predict selling_price in the home_test dataset.

Your trained model, lm_fit, as well as the test dataset, home_test have been loaded into your session.


# Predict selling_price
home_predictions <- predict(lm_fit,
                        new_data = home_test)

# View predicted selling prices
home_predictions
## # A tibble: 450 x 1
##      .pred
##      <dbl>
##  1 381944.
##  2 473211.
##  3 507232.
##  4 428599.
##  5 460506.
##  6 383744.
##  7 578196.
##  8 462470.
##  9 470367.
## 10 427326.
## # ... with 440 more rows

# Combine test data with predictions
home_test_results <- home_test %>% 
  dplyr::select(selling_price, home_age, sqft_living) %>% 
  dplyr::bind_cols(home_predictions)

# View results
home_test_results
## # A tibble: 450 x 4
##    selling_price home_age sqft_living   .pred
##            <dbl>    <dbl>       <dbl>   <dbl>
##  1        411000       18        1130 381944.
##  2        425000       11        1920 473211.
##  3        495000        3        2140 507232.
##  4        381000       25        1680 428599.
##  5        450000       25        1990 460506.
##  6        355000       37        1410 383744.
##  7        590000       11        2940 578196.
##  8        362500       20        1940 462470.
##  9        429000       13        1920 470367.
## 10        408000       23        1640 427326.
## # ... with 440 more rows

Congratualtions! You have trained a linear regression model and used it to predict the selling prices of homes in the test dataset! The model only used two predictor variables, but the predicted values in the .pred column seem reasonable!

1.7) Model performance metrics

Evaluating model results is an important step in the modeling process. Model evaluation should be done on the test dataset in order to see how well a model will generalize to new datasets.

In the previous exercise, you trained a linear regression model to predict selling_price using home_age and sqft_living as predictor variables. You then created the home_test_results tibble using your trained model on the home_test data.

In this exercise, you will calculate the RMSE and R squared metrics using your results in home_test_results.

The home_test_results tibble has been loaded into your session.


# Print home_test_results
home_test_results
## # A tibble: 450 x 4
##    selling_price home_age sqft_living   .pred
##            <dbl>    <dbl>       <dbl>   <dbl>
##  1        411000       18        1130 381944.
##  2        425000       11        1920 473211.
##  3        495000        3        2140 507232.
##  4        381000       25        1680 428599.
##  5        450000       25        1990 460506.
##  6        355000       37        1410 383744.
##  7        590000       11        2940 578196.
##  8        362500       20        1940 462470.
##  9        429000       13        1920 470367.
## 10        408000       23        1640 427326.
## # ... with 440 more rows

# Caculate the RMSE metric
home_test_results %>% 
  yardstick::rmse(truth = selling_price, estimate = .pred)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard      48603.

# Calculate the R squared metric
home_test_results %>% 
  yardstick::rsq(truth = selling_price, estimate = .pred)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rsq     standard       0.655

Great job! The RMSE metric indicates that the average prediction error for home selling prices is about $48,000. Not bad considering you only used home_age and sqft_living as predictor variables!

1.8) R squared plot

In the previous exercise, you got an R squared value of 0.651. The R squared metric ranges from 0 to 1, 0 being the worst and 1 the best.

Calculating the R squared value is only the first step in studying your model’s predictions.

Making an R squared plot is extremely important because it will uncover potential problems with your model, such as non-linear patterns or regions where your model is either over or under-predicting the outcome variable.

In this exercise, you will create an R squared plot of your model’s performance.

The home_test_results tibble has been loaded into your session.

library(ggplot2)

# Create an R squared plot of model performance
ggplot2::ggplot(home_test_results, ggplot2::aes(x = selling_price, y = .pred)) +
  ggplot2::geom_point(alpha = 0.5) + 
  ggplot2::geom_abline(color = 'blue', linetype = 2) +
  tune::coord_obs_pred() +
  ggplot2::labs(x = 'Actual Home Selling Price', y = 'Predicted Selling Price')

Good work! From the plot, you can see that your model tends to over-predict selling prices for homes that sold for less than $400,000, and under-predict for homes that sold for $600,000 or more. This indicates that you will have to add more predictors to your model or that linear regression may not be able to model the relationship as well as more advanced modeling techniques!

1.9) Complete model fitting process with last_fit()

In this exercise, you will train and evaluate the performance of a linear regression model that predicts selling_price using all the predictors available in the home_sales tibble.

This exercise will give you a chance to perform the entire model fitting process with tidymodels, from defining your model object to evaluating its performance on the test data.

Earlier in the chapter, you created an rsample object called home_split by passing the home_sales tibble into initial_split(). The home_split object contains the instructions for randomly splitting home_sales into training and test sets.

The home_sales tibble, and home_split object have been loaded into this session.

library(parsnip)
## Warning: package 'parsnip' was built under R version 4.1.3
library(tune)
## Warning: package 'tune' was built under R version 4.1.3
# Define a linear regression model
linear_model <- linear_reg() %>% 
  set_engine('lm') %>% 
  set_mode('regression')

# Train linear_model with last_fit()
linear_fit <- linear_model %>% 
  last_fit(selling_price ~ ., split = home_split)

# Collect predictions and view results
predictions_df <- linear_fit %>% collect_predictions()
predictions_df
## # A tibble: 450 x 5
##    id                 .pred  .row selling_price .config             
##    <chr>              <dbl> <int>         <dbl> <chr>               
##  1 train/test split 398406.     3        411000 Preprocessor1_Model1
##  2 train/test split 439917.     9        425000 Preprocessor1_Model1
##  3 train/test split 479403.    12        495000 Preprocessor1_Model1
##  4 train/test split 392640.    18        381000 Preprocessor1_Model1
##  5 train/test split 479367.    21        450000 Preprocessor1_Model1
##  6 train/test split 379795.    28        355000 Preprocessor1_Model1
##  7 train/test split 587186.    39        590000 Preprocessor1_Model1
##  8 train/test split 443547.    42        362500 Preprocessor1_Model1
##  9 train/test split 481819.    47        429000 Preprocessor1_Model1
## 10 train/test split 425624.    52        408000 Preprocessor1_Model1
## # ... with 440 more rows

# Make an R squared plot using predictions_df
ggplot(predictions_df, aes(x = selling_price, y = .pred)) + 
  geom_point(alpha = 0.5) + 
  geom_abline(color = 'blue', linetype = 2) +
  coord_obs_pred() +
  labs(x = 'Actual Home Selling Price', y = 'Predicted Selling Price')

Great work! You have created your first machine learning pipeline and visualized the performance of your model. From the R squared plot, the model still tends to over-predict selling prices for homes that sold for less than $400,000 and under-predict for homes at $600,000 or more, but it is a slight improvement over your previous model with only two predictor variables.

2) Classification Models

Learn how to predict categorical outcomes by training classification models. Using the skills you’ve gained so far, you’ll predict the likelihood of customers canceling their service with a telecommunications company.

2.1) Data resampling

The first step in a machine learning project is to create training and test datasets for model fitting and evaluation. The test dataset provides an estimate of how your model will perform on new data and helps to guard against overfitting.

You will be working with the telecom_df dataset which contains information on customers of a telecommunications company. The outcome variable is canceled_service and it records whether a customer canceled their contract with the company. The predictor variables contain information about customers’ cell phone and internet usage as well as their contract type and monthly charges.

The telecom_df tibble has been loaded into your session.


telecom_df <- readRDS("../../../datasets/tidymodels/telecom_df.rds")

dim(telecom_df)
## [1] 975   9

colnames(telecom_df)
## [1] "canceled_service"    "cellular_service"    "avg_data_gb"        
## [4] "avg_call_mins"       "avg_intl_mins"       "internet_service"   
## [7] "contract"            "months_with_company" "monthly_charges"

# Create data split object
telecom_split <- initial_split(telecom_df, prop = 0.75,
                     strata = canceled_service)

# Create the training data
telecom_training <- telecom_split %>% 
  training()

# Create the test data
telecom_test <- telecom_split %>% 
  testing()

# Check the number of rows
nrow(telecom_training)
## [1] 731
nrow(telecom_test)
## [1] 244

Good job! You have 732 rows in your training data and 243 rows in your test dataset. Now you can begin the model fitting process using telecom_training.

2.2) Fitting a logistic regression model

In addition to regression models, the parsnip package also provides a general interface to classification models in R.

In this exercise, you will define a parsnip logistic regression object and train your model to predict canceled_service using avg_call_mins, avg_intl_mins, and monthly_charges as predictor variables from the telecom_df data.

The telecom_training and telecom_test tibbles that you created in the previous lesson have been loaded into this session.

library(parsnip)
# Specify a logistic regression model
logistic_model <- logistic_reg() %>% 
  # Set the engine
  set_engine('glm') %>% 
  # Set the mode
  set_mode('classification')

# Fit to training data
logistic_fit <- logistic_model %>% 
  fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges,
      data = telecom_training)

# Print model fit object
logistic_fit
## parsnip model object
## 
## 
## Call:  stats::glm(formula = canceled_service ~ avg_call_mins + avg_intl_mins + 
##     monthly_charges, family = stats::binomial, data = data)
## 
## Coefficients:
##     (Intercept)    avg_call_mins    avg_intl_mins  monthly_charges  
##         2.06878         -0.01088          0.02185          0.00257  
## 
## Degrees of Freedom: 730 Total (i.e. Null);  727 Residual
## Null Deviance:	    932.4 
## Residual Deviance: 802.2 	AIC: 810.2

Great job! You have defined your model with logistic_reg() and trained it to predict canceled_service using avg_call_mins, avg_intl_mins, and monthly_charges. Printing a parsnip model specification object displays useful model information, such as the model type, computational engine, and mode. Printing a model fit object will display the estimated model coefficients.

2.3) Combining test dataset results

Evaluating your model’s performance on the test dataset gives insights into how well your model predicts on new data sources. These insights will help you communicate your model’s value in solving problems or improving decision making.

Before you can calculate classification metrics such as sensitivity or specificity, you must create a results tibble with the required columns for yardstick metric functions.

In this exercise, you will use your trained model to predict the outcome variable in the telecom_test dataset and combine it with the true outcome values in the canceled_service column.

Your trained model, logistic_fit, and test dataset, telecom_test, have been loaded from the previous exercise.

library(yardstick)
## Warning: package 'yardstick' was built under R version 4.1.3
## For binary classification, the first factor level is assumed to be the event.
## Use the argument `event_level = "second"` to alter this as needed.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Predict outcome categories
class_preds <- predict(logistic_fit, new_data = telecom_test,
                       type = 'class')

# Obtain estimated probabilities for each outcome value
prob_preds <- predict(logistic_fit, new_data = telecom_test, 
                      type = 'prob')

# Combine test set results
telecom_results <- telecom_test %>% 
  select(canceled_service) %>% 
  bind_cols(class_preds, prob_preds)

# View results tibble
telecom_results
## # A tibble: 244 x 4
##    canceled_service .pred_class .pred_yes .pred_no
##    <fct>            <fct>           <dbl>    <dbl>
##  1 yes              yes             0.591    0.409
##  2 no               no              0.205    0.795
##  3 yes              yes             0.788    0.212
##  4 no               no              0.166    0.834
##  5 yes              yes             0.577    0.423
##  6 yes              no              0.168    0.832
##  7 no               no              0.407    0.593
##  8 yes              no              0.363    0.637
##  9 no               no              0.151    0.849
## 10 no               no              0.101    0.899
## # ... with 234 more rows

Good job! You have created a tibble of model results using the test dataset. Your results tibble contains all the necessary columns for calculating classification metrics. Next, you’ll use this tibble and the yardstick package to evalute your model’s performance.

2.4) Evaluating performance with yardstick

In the previous exercise, you calculated classification metrics from a sample confusion matrix. The yardstick package was designed to automate this process.

For classification models, yardstick functions require a tibble of model results as the first argument. This should include the actual outcome values, predicted outcome values, and estimated probabilities for each value of the outcome variable.

In this exercise, you will use the results from your logistic regression model, telecom_results, to calculate performance metrics.

The telecom_results tibble has been loaded into your session.


# Calculate the confusion matrix
conf_mat(telecom_results, truth = canceled_service,
    estimate = .pred_class)
##           Truth
## Prediction yes  no
##        yes  31  16
##        no   51 146

# Calculate the accuracy
accuracy(telecom_results, canceled_service,
    estimate = .pred_class)
## # A tibble: 1 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.725

# Calculate the sensitivity
sens(telecom_results, canceled_service,
    .pred_class)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 sens    binary         0.378

# Calculate the specificity
spec(telecom_results, canceled_service, .pred_class)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 spec    binary         0.901

Excellent work! The specificity of your logistic regression model is 0.85, which is more than double the sensitivity of 0.32. This indicates that your model is much better at detecting customers who will not cancel their telecommunications service versus the ones who will.

2.5) Creating custom metric sets

The yardstick package also provides the ability to create custom sets of model metrics. In cases where the cost of obtaining false negative errors is different from the cost of false positive errors, it may be important to examine a specific set of performance metrics.

Instead of calculating accuracy, sensitivity, and specificity separately, you can create your own metric function that calculates all three at the same time.

In this exercise, you will use the results from your logistic regression model, telecom_results, to calculate a custom set of performance metrics. You will also use a confusion matrix to calculate all available binary classification metrics in tidymodelsall at once.

The telecom_results tibble has been loaded into your session.


# Create a custom metric function
telecom_metrics <- metric_set(accuracy, sens, spec)

# Calculate metrics using model results tibble
telecom_metrics(telecom_results, truth = canceled_service,
                estimate = .pred_class)
## # A tibble: 3 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.725
## 2 sens     binary         0.378
## 3 spec     binary         0.901

# Create a confusion matrix
conf_mat(telecom_results,
         truth = canceled_service,
         estimate = .pred_class) %>% 
  # Pass to the summary() function
  summary()
## # A tibble: 13 x 3
##    .metric              .estimator .estimate
##    <chr>                <chr>          <dbl>
##  1 accuracy             binary         0.725
##  2 kap                  binary         0.312
##  3 sens                 binary         0.378
##  4 spec                 binary         0.901
##  5 ppv                  binary         0.660
##  6 npv                  binary         0.741
##  7 mcc                  binary         0.335
##  8 j_index              binary         0.279
##  9 bal_accuracy         binary         0.640
## 10 detection_prevalence binary         0.193
## 11 precision            binary         0.660
## 12 recall               binary         0.378
## 13 f_meas               binary         0.481

Nice work! You created a custom metric function to calculate accuracy, sensitivity, and specificity. Oftentimes, you will be interested in tracking certain performance metrics for a given modeling problem, but passing a confusion matrix to the summary() function will calculate all available binary classification metrics in tidymodels at once!

2.6) Plotting the confusion matrix

Calculating performance metrics with the yardstick package provides insight into how well a classification model is performing on the test dataset. Most yardstick functions return a single number that summarizes classification performance.

Many times, it is helpful to create visualizations of the confusion matrix to more easily communicate your results.

In this exercise, you will make a heat map and mosaic plot of the confusion matrix from your logistic regression model on the telecom_df dataset.

Your model results tibble, telecom_results, has been loaded into your session.


# Create a confusion matrix
conf_mat(telecom_results,
         truth = canceled_service,
         estimate = .pred_class) %>% 
  # Create a heat map
  autoplot(type = 'heatmap')

# Create a confusion matrix
conf_mat(telecom_results,
         truth = canceled_service,
         estimate = .pred_class) %>% 
  # Create a mosaic plot
  autoplot(type = 'mosaic')

Great job! The mosaic plot clearly shows that your logistic regression model performs much better in terms of specificity than sensitivity. You can see that in the yes column, a large proportion of outcomes were incorrectly predicted as no.

2.7) ROC curves and area under the ROC curve

ROC curves are used to visualize the performance of a classification model across a range of probability thresholds. An ROC curve with the majority of points near the upper left corner of the plot indicates that a classification model is able to correctly predict both the positive and negative outcomes correctly across a wide range of probability thresholds.

The area under this curve provides a letter grade summary of model performance.

In this exercise, you will create an ROC curve from your logistic regression model results and calculate the area under the ROC curve with yardstick.

Your model results tibble, telecom_results has been loaded into your session.


# Calculate metrics across thresholds
threshold_df <- telecom_results %>% 
  roc_curve(truth = canceled_service, .pred_yes)

# View results
threshold_df
## # A tibble: 246 x 3
##    .threshold specificity sensitivity
##         <dbl>       <dbl>       <dbl>
##  1  -Inf          0                 1
##  2     0.0176     0                 1
##  3     0.0304     0.00617           1
##  4     0.0384     0.0123            1
##  5     0.0436     0.0185            1
##  6     0.0444     0.0247            1
##  7     0.0474     0.0309            1
##  8     0.0528     0.0370            1
##  9     0.0554     0.0432            1
## 10     0.0605     0.0494            1
## # ... with 236 more rows

# Plot ROC curve
threshold_df %>% 
  autoplot()

# Calculate ROC AUC
roc_auc(telecom_results,
    truth = canceled_service, 
    .pred_yes)
## # A tibble: 1 x 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.736

Nice work! The area under the ROC curve is 0.76. This indicates that your model gets a C in terms of overall performance. This is mainly due to the low sensitivity of the model.

2.8) Streamlining the modeling process

The last_fit() function is designed to streamline the modeling workflow in tidymodels. Instead of training your model on the training data and building a results tibble using the test data, last_fit() accomplishes this with one function.

In this exercise, you will train the same logistic regression model as you fit in the previous exercises, except with the last_fit() function.

Your data split object, telecom_split, and model specification, logistic_model, have been loaded into your session.

library(tune)
# Train model with last_fit()
telecom_last_fit <- logistic_model %>% 
  last_fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges,
           split = telecom_split)

# View test set metrics
telecom_last_fit %>% 
  collect_metrics()
## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.725 Preprocessor1_Model1
## 2 roc_auc  binary         0.736 Preprocessor1_Model1

Excellent work! Notice that you got the same area under the ROC curve as before, just with a lot less effort!

2.9) Collecting predictions and creating custom metrics

Using the last_fit() modeling workflow also saves time in collecting model predictions. Instead of manually creating a tibble of model results, there are helper functions that extract this information automatically.

In this exercise, you will use your trained model, telecom_last_fit, to create a tibble of model results on the test dataset as well as calculate custom performance metrics.

You trained model, telecom_last_fit, has been loaded into this session.


# Collect predictions
last_fit_results <- telecom_last_fit %>% 
  collect_predictions()

# View results
last_fit_results
## # A tibble: 244 x 7
##    id              .pred_yes .pred_no  .row .pred_class canceled_service .config
##    <chr>               <dbl>    <dbl> <int> <fct>       <fct>            <chr>  
##  1 train/test spl~     0.591    0.409     1 yes         yes              Prepro~
##  2 train/test spl~     0.205    0.795     5 no          no               Prepro~
##  3 train/test spl~     0.788    0.212     7 yes         yes              Prepro~
##  4 train/test spl~     0.166    0.834    10 no          no               Prepro~
##  5 train/test spl~     0.577    0.423    11 yes         yes              Prepro~
##  6 train/test spl~     0.168    0.832    14 no          yes              Prepro~
##  7 train/test spl~     0.407    0.593    15 no          no               Prepro~
##  8 train/test spl~     0.363    0.637    17 no          yes              Prepro~
##  9 train/test spl~     0.151    0.849    21 no          no               Prepro~
## 10 train/test spl~     0.101    0.899    28 no          no               Prepro~
## # ... with 234 more rows

# Custom metrics function
last_fit_metrics <- metric_set(accuracy, sens,
                               spec, roc_auc)

# Calculate metrics
last_fit_metrics(last_fit_results,
                 truth = canceled_service,
                 estimate = .pred_class,
                 .pred_yes)
## # A tibble: 4 x 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.725
## 2 sens     binary         0.378
## 3 spec     binary         0.901
## 4 roc_auc  binary         0.736

Great job! You were able to train and evaluate your logistic regression model in half the time! Notice that all performance metrics match the results you obtained in previous exercises.

2.10) Complete modeling workflow

In this exercise, you will use the last_fit() function to train a logistic regression model and evaluate its performance on the test data by assessing the ROC curve and the area under the ROC curve.

Similar to previous exercises, you will predict canceled_service in the telecom_df data, but with an additional predictor variable to see if you can improve model performance.

The telecom_df tibble, telecom_split, and logistic_model objects from the previous exercises have been loaded into your workspace. The telecom_split object contains the instructions for randomly splitting the telecom_df tibble into training and test sets. The logistic_model object is a parsnip specification of a logistic regression model.


# Train a logistic regression model
logistic_fit <- logistic_model %>% 
  last_fit(canceled_service ~ avg_call_mins + avg_intl_mins + monthly_charges + months_with_company, 
           split = telecom_split)

# Collect metrics
logistic_fit %>% 
  collect_metrics()
## # A tibble: 2 x 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.783 Preprocessor1_Model1
## 2 roc_auc  binary         0.852 Preprocessor1_Model1

# Collect model predictions
logistic_fit %>% 
  collect_predictions() %>% 
  # Plot ROC curve
  roc_curve(truth = canceled_service, .pred_yes) %>% 
  autoplot()

Excellent work! The ROC curve shows that the logistic regression model performs better than a model that guesses at random (the dashed line in the plot). Adding the months_with_company predictor variable increased your area under the ROC curve from 0.76 in your previous model to 0.823!

3) Feature Engineering

Find out how to bake feature engineering pipelines with the recipes package. You’ll prepare numeric and categorical data to help machine learning algorithms optimize your predictions.

3.1) Exploring recipe objects

The first step in feature engineering is to specify a recipe object with the recipe() function and add data preprocessing steps with one or more step_*() functions. Storing all of this information in a single recipe object makes it easier to manage complex feature engineering pipelines and transform new data sources.

Use the R console to explore a recipe object named telecom_rec, which was specified using the telecom_training data from the previous chapter and the code below.

library(recipes)
## Warning: package 'recipes' was built under R version 4.1.3
## 
## Attaching package: 'recipes'
## The following object is masked from 'package:stats':
## 
##     step
telecom_rec <- recipe(canceled_service ~ .,
                      data = telecom_df) %>% 
  step_log(avg_call_mins, base = 10)

summary(telecom_rec)
## # A tibble: 9 x 4
##   variable            type    role      source  
##   <chr>               <chr>   <chr>     <chr>   
## 1 cellular_service    nominal predictor original
## 2 avg_data_gb         numeric predictor original
## 3 avg_call_mins       numeric predictor original
## 4 avg_intl_mins       numeric predictor original
## 5 internet_service    nominal predictor original
## 6 contract            nominal predictor original
## 7 months_with_company numeric predictor original
## 8 monthly_charges     numeric predictor original
## 9 canceled_service    nominal outcome   original

Both telecom_training and telecom_rec have been loaded into your session.

How many numeric and nominal predictor variables are encoded in the telecom_rec object?

You got it! Based on the results from passing telecom_rec to the summary() function, you can see that 5 predictor variables were labeled as numeric and 3 as nominal by the recipe() function.

3.2) Creating recipe objects

In the previous chapter, you fit a logistic regression model using a subset of the predictor variables from the telecom_df data. This dataset contains information on customers of a telecommunications company and the goal is predict whether they will cancel their service.

In this exercise, you will use the recipes package to apply a log transformation to the avg_call_mins and avg_intl_mins variables in the telecommunications data. This will reduce the range of these variables and potentially make their distributions more symmetric, which may increase the accuracy of your logistic regression model.


# Specify feature engineering recipe
telecom_log_rec <- recipe(canceled_service ~ ., 
                          data = telecom_training) %>%
  # Add log transformation step
  step_log(avg_call_mins, avg_intl_mins, base = 10)

# View variable roles and data types
telecom_log_rec %>%
  summary()
## # A tibble: 9 x 4
##   variable            type    role      source  
##   <chr>               <chr>   <chr>     <chr>   
## 1 cellular_service    nominal predictor original
## 2 avg_data_gb         numeric predictor original
## 3 avg_call_mins       numeric predictor original
## 4 avg_intl_mins       numeric predictor original
## 5 internet_service    nominal predictor original
## 6 contract            nominal predictor original
## 7 months_with_company numeric predictor original
## 8 monthly_charges     numeric predictor original
## 9 canceled_service    nominal outcome   original

Great job! You have created a recipe object that assigned variable roles and data types to the outcome and predictor variables in the telecom_training dataset. You also added instructions for applying a log transformation to the avg_call_mins and avg_intl_mins variables. Now it’s time to train your recipe and apply it to new data!

3.3) Training a recipe object

In the previous exercise, you created a recipe object with instructions to apply a log transformation to the avg_call_mins and avg_intl_mins predictor variables in the telecommunications data.

The next step in the feature engineering process is to train your recipe object using the training data. Then you will be able to apply your trained recipe to both the training and test datasets in order to prepare them for use in model fitting and model evaluation.

Your recipe object, telecom_log_rec, and the telecom_training and telecom_test datasets have been loaded into your session.


# Train the telecom_log_rec object
telecom_log_rec_prep <- telecom_log_rec %>% 
  prep(training = telecom_training)

# View results
telecom_log_rec_prep
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          8
## 
## Training data contained 731 data points and no missing data.
## 
## Operations:
## 
## Log transformation on avg_call_mins, avg_intl_mins [trained]


# Apply to training data
telecom_log_rec_prep %>% 
  bake(new_data = NULL)
## # A tibble: 731 x 9
##    cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
##    <fct>                  <dbl>         <dbl>         <dbl> <fct>           
##  1 single_line            10.3           2.42          1.74 fiber_optic     
##  2 single_line             9.3           2.51          2.06 fiber_optic     
##  3 multiple_lines          9.4           2.49          2.17 fiber_optic     
##  4 multiple_lines         10.2           2.60          2.06 fiber_optic     
##  5 single_line             9.37          2.58          1.94 fiber_optic     
##  6 multiple_lines          4.11          2.57          1.81 digital         
##  7 multiple_lines         10.6           2.45          2.17 fiber_optic     
##  8 multiple_lines          5.17          2.53          2.08 digital         
##  9 single_line             8.67          1.97          2.12 fiber_optic     
## 10 multiple_lines          9.24          2.59          2.15 fiber_optic     
## # ... with 721 more rows, and 4 more variables: contract <fct>,
## #   months_with_company <dbl>, monthly_charges <dbl>, canceled_service <fct>

# Apply to test data
telecom_log_rec_prep %>% 
  bake(new_data = telecom_test)
## # A tibble: 244 x 9
##    cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
##    <fct>                  <dbl>         <dbl>         <dbl> <fct>           
##  1 single_line             7.78          2.70          2.10 fiber_optic     
##  2 multiple_lines          8.05          2.52          2.09 digital         
##  3 multiple_lines          8.01          2.72          1.99 fiber_optic     
##  4 multiple_lines          9.96          2.53          2.13 fiber_optic     
##  5 single_line             5.87          2.61          1.94 digital         
##  6 single_line             7.07          2.40          1.97 fiber_optic     
##  7 single_line             6.69          2.55          1.96 digital         
##  8 single_line             8.67          2.58          2.04 fiber_optic     
##  9 multiple_lines          7.86          2.58          2.21 digital         
## 10 single_line             7.31          2.39          2.08 digital         
## # ... with 234 more rows, and 4 more variables: contract <fct>,
## #   months_with_company <dbl>, monthly_charges <dbl>, canceled_service <fct>

Great work! You successfully trained your recipe to be able to transform new data sources and applied it to the training and test datasets. Notice that the avg_call_mins and avg_intl_mins variables have been log transformed in the test dataset!

3.4) Discovering correlated predictors

Correlated predictor variables provide redundant information and can negatively impact the model fitting process. When two variables are highly correlated, their values change linearly with each other and hence provide the same information to your machine learning algorithms. This phenomenon is know as multicollinearity.

Before beginning the model fitting process, it’s important to explore your dataset to uncover these relationships and remove them in your feature engineering steps.

In this exercise, you will explore the telecom_training dataset by creating a correlation matrix of all the numeric predictor variables.

The telecom_training data has been loaded into your session.

library(ggplot2)
telecom_training %>% 
  # Select numeric columns
  select_if(is.numeric) %>% 
  # Calculate correlation matrix
  cor()
##                     avg_data_gb avg_call_mins avg_intl_mins months_with_company
## avg_data_gb           1.0000000    0.20358690    0.13455547          0.40894052
## avg_call_mins         0.2035869    1.00000000    0.07123755          0.01822411
## avg_intl_mins         0.1345555    0.07123755    1.00000000          0.22130563
## months_with_company   0.4089405    0.01822411    0.22130563          1.00000000
## monthly_charges       0.9558920    0.20063607    0.13244094          0.43557378
##                     monthly_charges
## avg_data_gb               0.9558920
## avg_call_mins             0.2006361
## avg_intl_mins             0.1324409
## months_with_company       0.4355738
## monthly_charges           1.0000000

# Plot correlated predictors
ggplot(telecom_training, aes(x = avg_data_gb, y = monthly_charges)) + 
  # Add points
  geom_point()  + 
  # Add title
  labs(title = "Monthly Charges vs. Average Data Usage",
       y = 'Monthly Charges ($)', x = 'Average Data Usage (GB)') 

Great job! You explored the telecom_training data and discovered that monthly_charges and avg_data_gb have a correlation of 0.96. From the scatter plot, you can see that the more data customers use, the more they are charged every month. You will have to remove this redundant information with your feature engineering steps.

3.5) Removing correlated predictors with recipes

Removing correlated predictor variables from your training and test datasets is an important feature engineering step to ensure your model fitting runs as smoothly as possible.

Now that you have discovered that monthly_charges and avg_data_gb are highly correlated, you must add a correlation filter with step_corr() to your feature engineering pipeline for the telecommunications data.

In this exercise, you will create a recipe object that removes correlated predictors from the telecommunications data.

The telecom_training and telecom_test datasets have been loaded into your session.


# Specify a recipe object
telecom_cor_rec <- recipe(canceled_service ~ .,
                          data = telecom_training) %>% 
  # Remove correlated variables
  step_corr(all_numeric(), threshold = 0.8)

# Train the recipe
telecom_cor_rec_prep <- telecom_cor_rec %>% 
  prep(training = telecom_training)

# Apply to training data
telecom_cor_rec_prep %>% 
  bake(new_data = NULL)
## # A tibble: 731 x 8
##    cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
##    <fct>                  <dbl>         <dbl>         <dbl> <fct>           
##  1 single_line            10.3            262            55 fiber_optic     
##  2 single_line             9.3            326           114 fiber_optic     
##  3 multiple_lines          9.4            312           147 fiber_optic     
##  4 multiple_lines         10.2            402           116 fiber_optic     
##  5 single_line             9.37           382            87 fiber_optic     
##  6 multiple_lines          4.11           371            64 digital         
##  7 multiple_lines         10.6            281           147 fiber_optic     
##  8 multiple_lines          5.17           341           119 digital         
##  9 single_line             8.67            93           131 fiber_optic     
## 10 multiple_lines          9.24           392           142 fiber_optic     
## # ... with 721 more rows, and 3 more variables: contract <fct>,
## #   months_with_company <dbl>, canceled_service <fct>

# Apply to test data
telecom_cor_rec_prep %>% 
  bake(new_data = telecom_test)
## # A tibble: 244 x 8
##    cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
##    <fct>                  <dbl>         <dbl>         <dbl> <fct>           
##  1 single_line             7.78           497           127 fiber_optic     
##  2 multiple_lines          8.05           328           122 digital         
##  3 multiple_lines          8.01           525            97 fiber_optic     
##  4 multiple_lines          9.96           340           136 fiber_optic     
##  5 single_line             5.87           408            88 digital         
##  6 single_line             7.07           249            94 fiber_optic     
##  7 single_line             6.69           352            91 digital         
##  8 single_line             8.67           378           109 fiber_optic     
##  9 multiple_lines          7.86           378           164 digital         
## 10 single_line             7.31           246           119 digital         
## # ... with 234 more rows, and 3 more variables: contract <fct>,
## #   months_with_company <dbl>, canceled_service <fct>

Excellent! You have trained your recipe to remove all correlated predictors that exceed the 0.8 correlation threshold. Notice that your recipe found the high correlation between monthly_charges and avg_data_gb in the training data and when applied to the telecom_test data, it removed the monthly_charges column.

3.6) Multiple feature engineering steps

The power of the recipes package is that you can include multiple preprocessing steps in a single recipe object. These steps will be carried out in the order they are entered with the step_*() functions.

In this exercise, you will build upon your feature engineering from the last exercise. In addition to removing correlated predictors, you will create a recipe object that also normalizes all numeric predictors in the telecommunications data.

The telecom_training and telecom_test datasets have been loaded into your session.


# Specify a recipe object
telecom_norm_rec <- recipe(canceled_service ~ .,
                          data = telecom_training) %>% 
  # Remove correlated variables
  step_corr(all_numeric(), threshold = 0.8) %>% 
  # Normalize numeric predictors
  step_normalize(all_numeric())

# Train the recipe
telecom_norm_rec_prep <- telecom_norm_rec %>% 
  prep(training = telecom_training)

# Apply to test data
telecom_norm_rec_prep %>% 
  bake(new_data = telecom_test)
## # A tibble: 244 x 8
##    cellular_service avg_data_gb avg_call_mins avg_intl_mins internet_service
##    <fct>                  <dbl>         <dbl>         <dbl> <fct>           
##  1 single_line           -0.263       1.90         0.579    fiber_optic     
##  2 multiple_lines        -0.119      -0.309        0.418    digital         
##  3 multiple_lines        -0.140       2.26        -0.385    fiber_optic     
##  4 multiple_lines         0.896      -0.152        0.868    fiber_optic     
##  5 single_line           -1.28        0.735       -0.674    digital         
##  6 single_line           -0.640      -1.34        -0.481    fiber_optic     
##  7 single_line           -0.842       0.00419     -0.577    digital         
##  8 single_line            0.210       0.343        0.000703 fiber_optic     
##  9 multiple_lines        -0.220       0.343        1.77     digital         
## 10 single_line           -0.512      -1.38         0.322    digital         
## # ... with 234 more rows, and 3 more variables: contract <fct>,
## #   months_with_company <dbl>, canceled_service <fct>

Great job! When you applied your trained recipe to the telecom_test data, it removed the monthly_charges column, due to its large correlation with avg_data_gb, and normalized the numeric predictor variables!

comments powered by Disqus