Feature selection with caret tutorial

In this post you'll learn how to use to perform feature selection as part of your modelling workflow using caret. We'll run through caret's selection by filter, recursive feature elimination (backwards selection), simulated annealing and genetic algorithms. We'll also see how we can used nested resampling schemes to get unbiased performance estimates of our final model.

Recap from previous posts

Welcome to the final post in this caret package miniseries. In the first post we saw how we can use caret to split and pre-process our data, create robust resampling schemes and build powerful models. In the second post we learnt how to tune our models' hyperparameters and stack models into super learners to squeeze out some extra performance.

Today's post is going to cover the different feature selection methods available in caret. Although feature selection is typically something you'd do before or during the model build process, I’ve left it until the end as it’s important to have a solid understanding of how to build models and avoiding data leakage before covering feature selection. Max Kuhn the author of the caret package also has a book on feature engineering and selection that is available to read here.

Let's go ahead and set up our Train and Test data and create our resampling scheme. As this post is mainly to demonstrate the different feature selection methods, I'll also be applying the same pre-processing (calculated using a separate sample) to all of Train to speed up the model training steps:

What is feature selection?

Feature selection is the process of whittling down our set of input variables to a smaller set of (hopefully) the most useful features. Removing unnecessary features can speed up model training times and improve performance. An analogy might be to imagine revising for a test and only having to learn 10 topics rather than 100. Features might be removed because they have no or low value to the model – either because they’re not relevant to the target or because their utility is already captured by other features in the data set (e.g. if they’re highly correlated to other features). We might also want to remove features as it makes the upkeep and deployment of the model cheaper and faster and be willing to trade off predictive performance to achieve this.

We've actually already done some feature selection in the previous posts where we removed variables with zero or near zero variance on the basis that they were unlikely to be helpful to the final model. We also saw a way to remove highly correlated input variables as part of our pre-processing. Both of these techniques are a form of unsupervised feature selection as they don't use the target variable when deciding whether or not to keep features.

It’s also worth mentioning that a great form of unsupervised feature selection is domain knowledge and subject matter expertise. If you know from experience that some columns are likely to have data issues, be irrelevant or come from a source that won’t be available in the future then feel free to remove them.

"For a number of models, predictive performance is degraded as the number of uninformative predictors increases. Therefore, there is a genuine need to appropriately select predictors for modeling." - Feature Engineering and Selection: A Practical Approach for Predictive Models

Supervised feature selection in caret

The feature selection methods we'll be discussing today are all supervised methods as they all make use of the target column to assess which predictors or groups of predictor we want to keep or remove. Supervised feature selection methods are often more powerful as we only keep features that we know have a meaningful relationship with the target e.g. we know that keeping or removing them led to a better model performance or we've performed statistical tests that confirm a relationship between the inputs and the target.

The danger with supervised methods is that if they're not done properly we risk adding significant bias to our modelling process. By being supervised, the different methods get to see the target and so we can introduce significant information leakage if we're not careful. The other major downside is that to perform supervised feature selection in the correct way can be very time consuming to the point where the increased processing time isn't worth the trade off in performance gains. Max raises this point in this video in a section reviewing the development requests for the tidymodels package.

There are three main types of feature selection available in caret:

Intrinsic: these are models that perform feature selection automatically as part of their model build e.g. glmnet, anything tree based, cubist, etc. This is handy as it means we don’t need any extra steps in our workflow which can be quicker and features are selected precisely because they’re of benefit to our final model. The downside is we’re limited to only certain models and even then, models with intrinsic selection can still benefit from also being coupled with a process that removes useless or highly correlated features.
Filter methods: these apply some sort of pre-screening test to variables, usually with reference to the target. For example, they might check the correlation between the feature and the target or perform some other statistical test. Filter methods are handy as they’re quick to run and easy to understand. The downside is they pick features based on criteria separate to the model build process i.e. just because the feature passes the filter test it doesn’t mean the model will actually benefit from having it included. The other challenge is often the features are considered individually meaning that important interactions between features aren’t picked up e.g. a feature might only be significant when it’s considered alongside another one. Looking at features individually can also mean high correlations between groups of features aren’t picked up on.
Wrapper methods: these combine feature selection with the model build process and are probably the most powerful methods but also the most time consuming. Caret has 3 types of wrapper methods: recursive feature elimination (also known as backwards selection), simulated annealing and genetic algorithms.

We’ll run through each of these approaches in turn. For the filter and wrapper methods caret uses nested resampling to generate unbiased performance estimates of the process. This is where the model training, feature selection and hyperparameter tuning are all performed on an inner resample of data and then the outcome of this is assessed on an outer/external resample to give an accurate measure of performance. We'll go into this in more detail later on.

It’s also worth flagging that in this post I’m going to be making a lot more use of the Test set than I normally would. This is just for demonstration purposes to show how different feature selection processes can result in over-fitting/optimism bias. What we're not doing is using the Test set to compare the performance of different processes or select our final model. We’d use the resample performance estimates as normal if we were doing that.

How not to do feature selection

Feature selection done incorrectly can greatly skew the estimated performance of our model to make it appear like it’s performing much better than it really is.This is particularly likely to happen when we’re selecting features in a supervised manner i.e. with reference to the target. The most common way this happens is that feature selection is done prior to model training but using the same data that the subsequent model will be trained on. Such a process introduces data leakage as we’ve specifically picked variables that work with the target based on our data and then we’re asking the model to test that relationship with the same data. Unsurprisingly it’s going to think it can do a good job. As feature selection was also performed prior to resampling, we can’t undo the optimism bias introduced after the fact.

To go back to our exam scenario, imagine that rather than having to learn general topics that would be useful for a range of exam questions, you got to see the actual test questions in advance. You could then learn the answers just to those questions and do very well in the exam but it wouldn’t be a fair measure of your mastery of the topic and your knowledge won’t generalise to anything beyond that one paper.

For example, the features picked by the filter method might actually be due to a spurious correlation rather than a meaningful relationship with the target. These are features where for our sample of data there is a relationship between the feature and the target but that relationship doesn’t hold for future data and so including it could affect the generalisability of our model i.e. it’ll do well on Train but poorly on Test. These features are often the hardest to identify and remove and are why we need a robust feature selection method either by running it on a separate sample or by incorporating it into our resampling scheme. There’s a great website that plots spurious correlations and I now like to think of these variables as ‘Nicholas Cage variables’ in honour of how many of the examples make use of the number of films he’s been in. There's also a good xkcd cartoon that shows the trap of coming up with rules that work perfectly for the data available and then immediately fail when they encounter new data.

The bias created with improper feature selection shows up most strongly in data where we have few rows but lots of columns as the chances of getting spurious correlations between our features and our data is more likely. Max has a real world example of this happening in his book.The introduction of bias can still happen in larger data too though. To demonstrate how not to do feature selection we can use a 100 sample of our diamonds data set where we take the price column but replace all of the other features with randomly generated data:

If we look at our results we can see that the performance of the model (0.236 R Squared) isn’t very good which is what we’d hope to see given we’re using randomly generated data. Let’s pretend we don’t know this through and assume it’s because we’ve got too many redundant features.

From our varImp() we can see we’ve still got a ranking of which variables were the most important to our model. This should also serve as a warning to always consider variable importance alongside model performance i.e. they'll always be 'important' variables but that doesn't mean they're meaningfully important if the model itself is bad. As this is a demo of doing feature selection the wrong way, let’s go ahead and use these scores to perform feature selection and only keep the top 20 most important variables. Let's try re-running our model with just the top 20 features to see if it improves:

Our new model now claims to have an R Squared of 0.694 which is a big improvement. Before we congratulate ourselves on job well done however let's see how our new model performs on the Test set:

Ouch! That's a really bad R Squared. So what happened? Remember that variable importance in caret is calculated from the final model which learns using all the data in Train. So when we used the variable importance scores from the first model, what we did was pick 20 features that caret had identified as being related to the target after considering 100% of the Training data. It basically got to look at all the answers and then ranked the features that best helped it to score well on them.

When we ran our second model, using the same data that we got our variable importance scores from, it was therefore only using features that definitely had a strong relationship to the target (in Train at least). This relationship will also have been strong across every resample precisely because we picked features that were important for all of Train. This is why when the model came to mark its homework, it looked like it did well.

The problem is that we know the features were randomly generated so any correlation between them and the target is going to be spurious/down to bad luck rather than because it's found a genuine relationship that will generalise to new data. This meant that when we came to predict on our Test data, the variables previously found to work well on Train were indeed shown to be spurious by failing to work on Test and so our model performs poorly. This is also why the first model had a low R Squared. For each resample it would have also found spurious correlations that worked well on the Analysis set but failed to generalise on the Assessment set which is why its final performance sore was low. Unfortunately, when we took the important variable based off 100% of Train we removed the ability for the model to perform really badly as it could only learn from variables that, although spurious, worked reasonably well across all Training resamples.

Feature selection done incorrectly is one of the easiest ways to introduce significant bias into our model performance measurement. It also highlights the importance of having a separate Test set as if we'd just gone off the resampled performance estimates we'd think our model was doing well. Just as we saw in the first post when pre-processing our data, feature selection needs to be done either on a separate sample or as part of our resampling scheme to get unbiased performance estimates. Thankfully caret makes this easy to do.

Models with inbuilt feature selection

As discussed already, intrinsic methods of feature selection refer to any models that automatically performs feature selection as part of their training. Common examples of this are linear models with some form of regularisation (e.g. lasso, glmnet) and most tree-based models. You can see a list of models with inbuilt feature selection methods available in caret here.

A good tip from Max when considering feature selection is to start with a couple of intrinsic models to see what they yield. Max suggests using a linear model and a non-linear model as any differences in performance might suggest which family of model might be better for the problem at hand. We can then potentially try either the same model or another from the same family in a wrapper method. He also mentions that we shouldn't be surprised if different intrinsic methods pick different features as important to them e.g. a linear model might pick variables that have a linear relationship to the target whereas a tree based model might pick more features with non-linear trends and more interactions.

Let's run glmnet (linear model) and ranger (random forest) and compare which features get selected as important by them:

caret variable importance comparison.PNG

We can see that as far as the top variables picked they're the same for each model: carat, x, y and z. It looks like ranger ascribes most of the importance to just a few features whereas for glmnet it's a bit more evenly distributed. Glmnet possibly does a better job at weeding out the random noise features whereas for ranger they appear above some genuine features, although without much importance assigned to them.

We've seen already that if we wanted to use these variable importances as a filter, e.g. run another ranger or glmnet model with a cut down list, we can't just re-run our model with the shorter list as this introduces bias to the process. One way round this is to 'spend some data' and run the inbuilt model we want to use as a filter on a separate sample like how we did to calculate our pre-processing parameters.

Feature selection using a separate sample

By taking a separate sample to run our feature selection process on, the hope is that we can identify important features from the sample that also work for our Training data but as the final model is built on the Train-minus-sample data we avoid introducing bias. For example, if we get unlucky and pick lots of spuriously correlated features from our sample they can still fail to work on our Train-minus-sample and so our model can still be identified as performing poorly.

Let's see how this works by running two glmnet models - one on a sample to identify important features and then a second to run on the rest of our Training data but with only the features identified as important by the first model:

We can see by performing feature selection on a separate sample we've avoiding leaking any information about the target as our model performs similarly on our Train and Test data. In this example we used two glmnet models but if we wanted to, we could try a different model with our filtered list. Let's try using a neural network for our second model:

In theory there's nothing stopping us using a variety of different models with intrinsic feature selection methods to generate a smaller list of variables to try with subsequent models. One thing to note is that we probably want to pair similar families of models together e.g. use a linear model like glmnet to select features for subsequent linear models like lm or use a non-linear model like ranger to select features for subsequent non-linear models like nnet. For the best results Max recommends using the same model for feature selection and final training as this way we guarantee the features we pick will work best with our final model.

Running feature selection on a sample like this can be a good approach if you've got lots of data. The next set of approaches all require nested resampling schemes and so can take a long time to run.

Nested resampling schemes

We touched on nested resampling briefly in the second post on hyperparameter tuning and model stacking as it's sometimes used as a way to avoid optimisation bias when tuning hyperparameters. The worry is that by tuning parameters on the same resamples that we also measure the final performance of the model on, we might end up overstating its actual likely performance. As we've already seen, the chance of this happening when performing feature selection is much higher and so it's worth implementing a nested resampling scheme (or taking a separate sample) so we can still get unbiased estimates of our final model.

Below is a diagram that tries to explain how nested k-fold cross-validation works. I've included the percentages of the overall data set that are included in each step to give a better idea of how each of the splits occurs downstream of others. The below diagram uses a single held out data set for the outer/external Assessment set to test the final model performance on. For the model training and hyperparameter optimisation that happens in the internal resamples, 5-fold cross-validation is used. Usually for the external resamples we'd want more than just one Assessment set which is why nested resampling can take such a long time. The formula for calculating models trained is roughly: number of external resamples x internal resamples x number of hyperparameters. For our 10 external folds, 10 internal folds repeated 5 times and 25 hyperparameters we'd be building 12,500 models in total!

Nested k-fold cross-validation works by taking our Train data and splitting it into an Analysis and Assessment set. These form our outer/external resamples. The external-Analysis set is then split into further internal Analysis and Assessment sets. The external Assessment is where we pick which features we want to pass to the model to try e.g. we filter them, we run a model and then drop the least important features, etc.

This shorter list of features is passed to the internal Analysis set where we then train our model and tune any hyperparameters. The model is assessed on the internal Assessment sets. Once the best model for the subset of features has been found on the internal Assessment sets, we measure its performance on the external Assessment set. This way we get an unbiased estimate of its performance because the external Assessment set was excluded from the feature selection and model training/tuning process.

You can have a look at how Max describes the steps of the process for the implementation of recursive feature elimination in caret here and Max has his own diagram of the steps from his book on feature selection here. The main takeaways are that it's needed to give us unbiased estimates of our model performance and it involves building lots of models which is why sometimes it's quicker to just spend some data and run feature selection on a separate sample. Let's now see how we can perform feature selection in caret utilising nested cross fold validation.

Recap

Nested resampling

How not to do it

Intrinsic

what is feature selection

Supervised feature selection

Spendng data

Selection by filter (SBF)

Selection by filter (sbf) pre-screens our input features before passing them to our model. You can read more about the theory of them in Max's book here. In caret's implementation, for classification problems it runs anova tests on the features and the target and for regression problems like ours it builds a generalized additive model (GAM). For both the anova test and the GAM model the resulting p-value of the test is used to filter for significance. We can specify whether we want all our features tested at once or whether to test them individually by changing the 'multivariate=F' option in sbfControl(). By default features as tested one by one.

For each of the feature selection methods available, caret offers the option to combine them with a model of our choice or to use a pre-made model-plus-wrapper instead. We'll look at using a ready-made model initially but we'll focus on the more customisable option of running our own models combined with the selection by filter. The full list of ready-made models for the filter method can be found in the 'see also' section at the bottom of this page.

Running a model with a feature selection method in caret is very similar to what we've been doing with train() and trControl() except now we call our wrapper method e.g. sbf() instead of train(). This tells caret we're using a feature selection method and we also have some extra control options to specify which we do in our method-named control object e.g. sbfControl(). Let's go ahead and run a ready-made random forest with selection by filter to get an idea of how it all fits together:

There are a few different bits to look at here. In the sbfControl() we tell caret we want to use one of its ready-made random forest model schemes which means we don't need to specify things like 'method=' or 'trControl=' when we come to build the model. The sbfControl() is also where we specify the outer/external resampling scheme. So in this case we told caret to use 5 folds and to repeat the process 5 times so we can get a good, robust measurement of how our filtered model is performing.

On the print out we can see that we're told 'Select By Filter' was used and our outer resampling scheme is confirmed for us. We then see the performance which is the average performance across the external resamples. We can see that 'On average, 14 variables were selected' with a minimum of 13 and a max of 15. This tells us that our filter was pretty consistent in whittling down our features. If this number bounced around a lot we might worry that our data is very noisy.

We can see which variables made it through our filter by calling our sbf object with 'optVariables'. The usual suspects of carat, x, y and z are in there which is good to see. We can also see which ones failed to make the cut which includes our 4 random variables which is also reassuring.

Let's try running our sbf with our own model now. For this we do need to specify options such as trControl, method and tuneLength. We can incorporate our sbfControl() into our sbf() function too. We change the 'rfSBF' (random forest selection by filter) to 'caretSBF' which means we can run any model available in caret:

Let's score both our filtered models on Test just to confirm that our nested resampling scheme worked and stopped either model from overfitting:

Wrapper Methods

Whereas the filter method applied its own criteria, separate to the model build, to decide whether a variable goes into the model or not, a wrapper method typically ‘wraps’ the model building process into a broader approach that uses information directly from the trained model to try and optimise which features are selected. There are 3 wrapper methods available in caret: recursive feature elimination (also known as backwards selection), simulated annealing and genetic algorithms. All 3 wrapper methods employ the same nested cross validation scheme we used with the filter method to ensure we can still get unbiased performance estimates of our final model.

It’s also worth mentioning that rather than directly identifying the optimal subset of variables to pass to the final model, what we get from the wrapper methods are the optimal strategy of running feature selection for the final model. For example, recursive feature elimination tests different sized groups of features as inputs into the model to work out what the optimal number of features is. The resampling doesn’t pick which e.g. 10 features are the best and should be used by the final model but rather it says ‘use 10 features as selected by RFE for the final model’.

The same is true of simulated annealing and genetic algorithms. Simulated annealing uses the inner resampling to identify the optimum number of iterations to run simulated for and then the number that led to the best outer loop resampling performance is selected as the number to use for the final model. The features that got picked on the inner loops might be different to the ones picked by the final model so what our performance estimate tells us is more ‘using 10 features via RFE on this data results in this performance’ as opposed to ‘using these 10 exact features leads to X performance’.

Due to the fact that they work with the model directly and are responsive to what features led to improved performance, wrapper methods are generally more powerful than filter methods at identifying optimal sets of features. However, this comes at a cost of time as a lot more models need to be built.

Recursive Feature Elimination (aka Backwards Selection)

Recursive feature elimination (RFE) works by building a model with all the features, measuring the model performance and calculating the variable importance for each feature. The least important features are then removed and the model is rebuilt with the smaller feature set and its performance is measured. This continues until only 1 feature remains or we hit the specified minimum number of features that must be included in the model. For each smaller subset of features we can see the model performance and then pick either the best performing model or our preferred trade off between performance and complexity (number of features). Max has a chapter on RFE in his book here.

It's worth noting that the default option in caret is that the variable importance scores/rankings are only calculated once at the start when all the features are present and not recalculated at each iteration after features are removed. This differs to how rfe in sklearn works which recalculates variable importance/rankings at each iteration. If you want to recalculate them, you can change the "rerank=T" option in 'rfeControl' but Max mentions here that it generally performs slightly worse.

In terms of how we run this in caret, we call rfe() which tells caret we want to do recursive feature elimination. Just like for sbf() and train() all our model build options are the same apart from we specify ‘rfeControl=’ at the end which determines our outer loop resampling scheme and the number of different feature set sizes we want to investigate. For example, we could tell caret to try removing 1 feature at a time and recomputing but this might take a while. Instead, we can tell it to explore different set values of feature sizes and then see which is best. Like with sbf() caret also has some ready made rfe() options which you can see here.

RFE can become quite slow to run as the entire inner/outer resampling has to be repeated for each different feature subset size we request. For example, in the below example we ask for 3 different sizes to be tested + 1 original model using all the features. We use 5 folds x 5 repeats for external resampling and 10 folds x 5 repeats for internal along with 25 hyperparameters. This means we're building somewhere in the region of 125,000 models in total, so we'll use parallel processing to try and speed things up:

We can see that our model with 20 features looks like it performed the best. If we have a look at what features weren't included in this model we can see it's the randomly generated ones as well as clarity_VS1. It makes sense that removing the random noise features resulted in a better model.

A handy option for RFE is to tell caret to accept a slightly worse model than the optimum if it means we can make do with fewer features. For example, a minor drop in performance on significantly fewer features might be a worthwhile trade off as our model will be faster to score, easier to maintain and perhaps have less chance of overfitting. We can do this by overwriting caret’s ‘pickSizeTolerance’ function and saving it into a new object that we then pass to the model:

Now caret selects the 10 feature model as the best as it's within the 10% tolerance we allowed compared to the best performing model. It’s worth noting that RFE only makes sense for models that calculate a good feature importance score which is why it tends to work really well with random forests. Remember from the first post for example how something like glmnet uses the absolute values of the model coefficients and so the importance scores are affected by the scale of the data. Remember as well how caret uses a standard filter method for a lot of models that don’t calculate their own feature importance scores e.g. something like a neural network.

The other thing to note about RFE is that it’s a ‘greedy search method’ so called as it only ever takes the best option available to it and never goes backwards to re-evaluate its choices. An example of RFE being greedy is that it always removes the least important features and never tries adding some back in once they've been removed. This can lead to suboptimal solutions as it's possible that something we took out early on might have become more significant later on once other features had been removed but RFE would never be able to test this. The next two feature selection methods are both non-greedy so in theory have a better chance of finding a better combination of features although this comes at the expense of much longer run times.

Simulated Annealing

Simulated annealing (SA) aims to mimic the process of heating and cooling metal to improve its strength. How this relates to feature selection is we pass our model a randomly generated subset of features and then try adding and removing a few features at random, computing the performance each time and hoping that eventually we converge on the optimal subset. For a full walk-through of the process check out the chapter on simulated annealing in Max's book.

The good news about simulated annealing is that as it works by randomly changing the feature subset it can work with models that have no inbuilt measure of feature importance. Another nice quality is that it's a non-greedy method. It’s happy to persist with a suboptimal subset of features or change a currently optimal subset in the hope that it leads to a better overall solution. It also avoids going down a rabbit hole of poor performance by stopping and restarting at the last known best point if it doesn’t see an improvement in performance after a certain number of iterations. There’s a nice video here demonstrating how tidymodels actually uses it to find optimal combinations of hyperparameters but we can use it for feature selection too.

Once again we use nested resampling to measure performance. The feature subset and any hyperparameter combos get evaluated in the inner resamples and then the result is scored on the held out external fold. The inner resamples are used to feed back to the SA process which feature subsets look like they'd make for a better model and to guide it in which features it keeps or removes. However the SA process can eventually start to over fit the internal resamples as over time it'll start to pick features that work well on the inner resamples but don't generalise well when assessed on the external resamples.

The external resamples allow us to get an unbiased estimate of the SA process's performance at each iteration and to identify the point at where overfitting starts to happen. Using these unbiased estimates of performance, the iteration that had the highest score on the external resamples is then picked as the best number of iterations to run SA for. The SA process is then run on all of the Training data (for that number of iterations only) and whatever feature set is present at that final iteration is used to create the final model.

Now for the bad news. Remember how RFE took a while to run as we needed to build models for each of the outer resamples x inner resamples x hyperparameters x number of feature subsets which made for a lot of models? Well now we have the same challenge but instead of 4 differently sized feature subsets we can easily have 50+ iterations of feature subsets to iterate over each with their own inner/outer/hyperparameters to train models for.

In terms of how SA is implemented in caret there are a few extra options to specify this time. Firstly, we need to change how we pass our data to the model. Up until now we've been using the formula interface of y ~ x but now we need to use the ‘x=’ and ‘y=’ form. We call our safs() function (simulated annealing feature selection) and we pass it all our normal train options and also the number of iterations we want our simulated annealing process to try. We then call ‘safsControl=’ to specify our outer loop resampling scheme, which assessment metrics caret should use for each resampling scheme and also how many iterations SA can run for without seeing an improvement before it restarts and tries another direction. Let's go ahead and run that with our glmnet model and see that happens:

It looks like 45 iterations were found to create the best performing feature subset which resulted in 15 features being selected for the final model. If we have a look at which variables were included at this stage we can see that we've actually got a randomly generated feature in there which is a bit surprising. We can also plot the Interval v External performance scores to see if/when they started to diverge:

The scores actually look pretty similar which is good. It looks like our initial feature set wasn't very good but improved dramatically after 5 iterations and again around the 24th iteration. The fact it looks like it's continuing to improve without overfitting suggests we could try the process with even more iterations to hopefully get to an even better subset of features. Let's see how our Train and Test scores compare:

It looks like our estimated Train performance was actually a little pessimistic which could be due to only using 5 folds for the external resample. This is one of the challenges in trading off increased runtime vs confidence in our resampling performance estimates. This is particularly true for the next method which is probably the most powerful (and definitely the slowest) feature selection method that caret has to offer.

Genetic Algorithms

The genetic algorithm (GA) feature selection method in caret is probably the most powerful and also takes the longest to run. I’ve honestly never managed to use it in a real world setting as I’ll inevitably run out of time before it finishes. That said it contains a lot of cool ideas so it’s worth exploring. Like simulated annealing the ideas behind genetic algorithms can also be repurposed for hyperparameter selection. You can read about it in Max's book here.

Essentially it builds on the ideas of evolutionary biology where the strongest/best performing members from one generation get the chance to breed and pass their information down to the next generation. In relation to feature selection, we create and test a large number of different, random combinations of features and the ones that lead to the best models are then mixed and their 'offspring' form the next generation of feature subsets to be tested. The idea is that, over time, by mixing the best performing subsets from each generation we converge on the optimal feature subset for our model.

Again like SA, genetic algorithms don't rely on a measure of feature importance to rank variables but rather measure overall model performance and so they can be used with any modelling algorithm. It's also a non-greedy approach as although the best performing subsets are mixed each time there is an element of chance to which features are inherited and also the chance of a 'mutation' i.e. adding in a feature that might not actually have been in either parent feature subset.

To make sure our GA feature selection code actually finishes, we’ll run it on the sample we took earlier when testing the intrinsic methods. I'll also use a regular lm model so there aren't any hyperparameters to tune and use a much smaller population size than normal (usually 50+). I'll also cap our maximum number of generations much lower than normal (Max uses 200 in his example!). This example is more about demonstrating how to set the code up than trying to build a good model!

Like with SA, we use a nested resampling scheme. The inner resamples are used to assess the performance of our different feature subsets at each generation and the external Assessment set measures how the best model created at each generation performs on unseen data. The number of generations that led to the strongest performance on the external Assessment set is then picked as the number of generations to run GA for on the whole of Train. Let's go ahead and run our GA with lm:

We can see that our best performing model was picked at iteration 49 which included 17 of the 25 possible features. If we have a look at these features we can see again that some of the random noise features were included which is a bit surprising. Like with SA we can also plot the results of the internal v external resamples to see the change in performance between generations and also if any overfitting is occurring:

The scale makes it looks like it's overfitting quite badly whereas the actual difference is still quite small. What it does show though is how powerful/aggressive the genetic algorithm is in finding an optimal feature subset that maximises performance on the internal resamples (the data it has to access to during the training process). We can see an improvement in the internal score with nearly every new generation and although not quite the same, we can see a gradual trend of increasing accuracy on the external resamples over time too.

This shows why it is so important to have the nested cross validation scheme as its easy to see how if we left the genetic algorithm running even longer we'd probably see an increased internal performance but at some point the external performance would plateau or start to get worse. Let's do one final check to make sure we aren't overfitting by trying our model on the Test set:

Conclusion

Congratulations! That covers off feature selection in caret and finishes this miniseries on the caret package. Hopefully you've found it useful and feel confident applying the different techniques and methods to your own work. Well done. If you'd like to learn other ways of building models in R check out this post on tidymodels.

Conclusion

Simulated Annealing

RFE

Wrapper

Genetic algorithms

Filter

Feature selection with caret tutorial

Contents

​

Recap from previous posts

What is feature selection?

"For a number of models, predictive performance is degraded as the number of uninformative predictors increases. Therefore, there is a genuine need to appropriately select predictors for modeling." - Feature Engineering and Selection: A Practical Approach for Predictive Models

Supervised feature selection in caret

How not to do feature selection

Models with inbuilt feature selection

Feature selection using a separate sample

Nested resampling schemes

Selection by filter (SBF)

Wrapper Methods

Recursive Feature Elimination (aka Backwards Selection)

Simulated Annealing

Genetic Algorithms

Conclusion