top of page

Hyperparameter tuning and model stacking with caret and caretEnsemble tutorial

Welcome to the second post in this series on the caret package. In this post you'll learn how to tune the hyperparameters of your models to increase their performance and also combine all your models using caretEnsemble to create a super learner.

caret stack picture.PNG

Recap from the previous post


Welcome to the second post in this caret package miniseries. In the previous post we learnt how to split and pre-process our data and build and measure different regression and classification models. Today's post builds on these learnings and focuses on how we can try and improve our model performance even more by tuning hyperparameters and stacking or ensembling different models to form super learners.

 

If you've come from the previous post and have all the objects created already that's great. If not, you can run the code below to create everything you need. You can refer back to the first post to understand what each step is doing.

caret hyperparameter untuned.PNG

The above code sets up the Train and Test data, creates our resampling scheme (10 folds repeated 5 times) and then runs the 5 models that performed the best from the previous post. We'll now attempt to improve these even more with hyperparameter tuning and also create a 6th model which will be a neural network.

What are hyperparameters?

 

All 5 of the algorithms used have different hyperparameters that can be tuned to try and increase their performance. Hyperparameters are additional options (or parameters) that affect how the model trains but aren't learnt from the data directly but are instead passed to the model by us. For example, if we wanted to run a k-nearest neighbour model we'd need to specify some value of k (the number of nearest neighbours to find and use) for the model. If we were building a random forest we need to tell the model how many different decision trees to include in the ensemble or how deep to build each tree.

Hyperparameters are often specified as a 'grid' of values where each column is the name of a different parameter and each row the value to try. This grid forms the hyperparameter space for the model i.e. it captures all the different possible values and combinations of hyperparameters that we might pass to the model. This grid in then 'searched' i.e. the different values are used to train models and then the best combination is picked either by seeing which one led to the best model or on some other criteria e.g. a slightly worse performance but it uses fewer features.

There are a few different ways we can construct and search this grid. We can either specify our grid of hyperparameter values manually or have caret construct it for us. An 'exhaustive grid search' literally tries every single combination of values. This is very thorough but can also be very slow as we have to build a model (with 10 folds repeated 5 times) per combination. Another option is to search a random sample of values from the grid in the hope that a 'good enough' or close to optimal combination can be found much more quickly. More advanced techniques (not available in caret but are in tidymodels) remember the model performance for each combination of hyperparameter and then use this knowledge to guide their search within the grid.

 

Hyperparameter tuning in caret - grid search

 

I'm going to use glmnet to demonstrate each of the different options as it trains very quickly and only requires 2 hyperparameters to be tuned. We'll see how we can have caret pick our hyperparameter values for us or specify them manually as well as conduct exhaustive and random searches. We’ll also see a clever technique caret has up its sleeve called ‘adaptive resampling’ that can speed up the search for the best hyperparameter combination. We'll see how we can use it to quickly train and improve a neural network and at the end we'll combine all our learnings by stacking our 6 models together to create a super learner!

The simplest way to tune hyperparameters in caret is to have it generate our grid for us and then perform an exhaustive search. We can do this easily by changing the value of the tuneLength option in train(). How tuneLength works is we tell it how many values to create e.g. tuneLength=5 and it tries to generate sensible values for each of the hyperparameters the model uses e.g. it create 5 values per parameter. These values are then cross-joined across each of the different hyperparameters to make our grid. The cross join guarantees that we test every possible combination of parameters e.g. if we have 5 values of A and 5 values of B to get every possible combination of A x B we need a grid of 25 hyperparameter combinations.

 

For models that require a lot of hyperparameters (e.g. xgbTree) our grid can get very big, very quickly! Some modelling algorithms have some clever optimisation behind the scenes that allows them to test more hyperparameter combinations without having to train models for each of them which can speed things up. Generally though, to do an exhaustive search across lots of parameters can take a long time so it's worth knowing how many hyperparameters a model takes before setting tuneLength. We can see what hyperparameters our model requires by calling modelLookup() on it. Let's try it for glmnet:

We can see for glmnet we have parameters ‘alpha’ and ‘lambda’. You can read more about what these parameters do in the glmnet documentation here but, very roughly, lambda decides how much to penalise or shrink coefficient values by whilst alpha decides how much of a mix between ridge regression and lasso regression our model is. Ridge regression tries to shrink any large coefficients so that the contribution to the model is more evenly distributed across all variables rather than any one having a bigger impact than the others. Lasso takes a similar approach but unlike ridge regression it's possible for variables to get a coefficient of 0 which effectively removes that variable from the model completely. This mixing of lasso and ridge regression makes glmnet an elastic net model.

 

Let's go ahead an run a model with tuneLength=5. As glmnet has 2 parameters we'll end up with 5x5 combinations, so 25 total parameter combinations to try. Coupled with the fact we’re also doing 10 folds repeated 5 times, that’s 50 models x 25 hyperparameter combinations so 1,250 models to be trained!

We can see from the output that caret shows us the performance for each of our hyperparameter combinations and helpfully tells us which performed the best. We can extract these 'best' hyperparameter values by calling our object with $bestTune. Notice though that caret uses RMSE as its default performance metric for regression problems whereas we're more interested in MAE for this problem. We’ll see how we can change this shortly. We can see that there's quite a variation of MAE from ~0.12 up to ~0.4. We can pass our train object to ggplot to see how the different parameter combinations performed in a more visual way:

caret hyperparameter plots.PNG

Looking at the charts it looks like smaller values of lambda (regularisation parameter) perform better whereas the mixing percentage (alpha) has less of an impact when lambda is small. The final or 'best model' that caret picks is the one that gets stored in our train object and becomes the one used for all subsequent operations e.g. if we were to call predict() to score some new data. We can call all of the helper functions we’ve used before such as getTrainPerf() or vaImp() in exactly the same way too and it'll be the 'best' model that they use.

As mentioned, caret picked the model with lowest RMSE as the best model but we specify a different measure of performance by passing ‘metric =’ to our train() function. Let’s pass it MAE to use and see if that results in it picking a different final set of hyperparameters:

If we look at our new object (I've truncated the print out to save space), we can see that caret has indeed now used MAE to pick the best performing model. If we compare the chosen hyperparameters for the model using RMSE and the latest using MAE it looks like it’s the same set that performed best for each metric which is reassuring.

 

As well as letting caret choose the best model for us, we can also manually specify which hyperparameter combinations we want to use by calling the update() function. We do this by calling update() on our train object and passing it the values of the hyperparameters we want to use for the final model (we can specify any of the values that were tried as part of our search). For example, in our plots we saw that smaller values of lambda worked better but the mix had less of an impact. We might be happy to sacrifice some performance and use a larger alpha value as this should mean we remove more features from the model making it simpler.

If we call the object (again I've shortened the print out) we now see we get a message telling us that the best model was chosen manually.

Exhaustive grid searches with tuneLength are a great way to easily try lots of different combinations of hyperparameters but they can come at the cost of a longer training time. It can be very easy to accidentally train a very large number of models if the chosen algorithm has a lot of parameters. One way to speed up the process is to try a random search of a set number of combinations.

Random hyperparameter searches

A random search creates our grid of possible hyperparameters as before but then rather than run through all combinations exhaustively, it just picks a set number at random to try. The idea is to get a similar performance to trying lots of hyperparameters combinations from across a large space but without having to try every single combination from our grid. Some models benefit from a full grid search due to other optimisations they perform at the same time (glmnet is actually one of these). You can see a list of these models here.

To perform a random grid search automatically in caret we need to change our trainControl() object to tell caret that our search is now to be 'random'. We then pass this to our train() function and now our tuneLength becomes the number of combinations we want to try rather than the number of values per parameter we want to combine into a grid:

We can see that the MAE of the best model for each search type is very close. We can also see that for our random search we got some combinations of hyperparameters that resulted in NaN for our R Squared value. This can happen when the model can't make a prediction with the specified parameters and so defaults to just using the average value of the target. This causes the R Squared calculation to fail as there is no variance in the predictions with which to calculate correlations.

Specifying hyperparameter grids manually

As well as caret picking hyperparameter values for us, we can also create our own grids for caret to use. As glmnet is a mixture of ridge regression and lasso regression we can take advantage of this fact to quickly run different families of linear regression models by tweaking the hyperparameters we pass to glmnet. For example, alpha=0 tells it to run a pure ridge regression model. If we couple this with lambda=0 i.e. no penalty what we're actually running is a straight forward linear regression model. If we choose alpha=1 then we can also have a pure lasso model. We can then also try different values of alpha to run our elastic net whilst also trying different values of lambda.

 

To create our own hyperparameter grid we can use the expand.grid() function where we create a named vector of values for each parameter and then these are combined to create our grid. We then replace tuneLength in our train() function with the 'tuneGrid=' option. By default an exhaustive grid search will be performed on the grid. We saw that lower values of lambda seemed to perform better so we can factor this into the values we pass to caret:

caret manual grid search.PNG

On the plot, our lambda value is called 'regularistion parameter' and our alpha the 'mixing percentage'. We can see from the output that our lm model (regularisation parameter=0 and mixing percentage=0) performs pretty well but can be improved upon with a small lambda value of 0.01. Our pure lasso model (mixing percentage=1) just beats our pure ridge regression (mixing percentage=0) for that lambda value but the best model is a mix of the two (alpha=0.15).

We can also perform random searches when creating our own grids for caret in an attempt to speed up the process. If we wanted to perform a random grid search, rather than trying every value in our manually created grid, we can do this by using our random search trainControl and passing the number of random combinations to use via tuneLength:

It looks like we got lucky and the optimal hyperparameter combination was one of the ones selected in our random search. This meant we managed to get to the optimal combination from trying just 25 different combinations as opposed to 121 in our exhaustive grid search.

 

This is great for saving time but the worry with random searches is we might miss out on the best combination if we're unlucky. Ideally we want some sort of middle ground between the two approaches where we don't spend ages persevering with obviously bad combinations but in a way that still tests all the possible values to give us the best chance of finding that best combination. Thankfully caret has another trick up its sleeve to try and achieve just this!

Recap from the previous post
Grid search
What are hyperparameter?
Random search
Manual grid
Adaptive resampling

Faster searching with adaptive resampling and 'racing' methods

 

The main downside of hyperparameter tuning is that it can take a long time. To find the best combination we have to build lots of models and assess their performance in a robust way which usually means lots of resampling. We also saw that some combinations of hyperparameters resulted in some pretty poor models. One way to speed up this process is to perform adaptive resampling also known as a ‘racing method’. You can see a video of Max explaining how it works in tidymodels here.

Adaptive resampling tries to save time by weeding out any obviously bad hyperparameter combinations early on in the resampling process rather than waiting until the end. In our case, we’re using 10 folds repeated 5 times so that’s 50 resamples. If we can see that a certain combination of hyperparameters results in a really bad model, compared to its peers, after say 10 resamples, it’s probably safe to assume it’s not going to magically improve and beat them all after the other 40 resamples. Even if it somehow did, the large spread in performance i.e. really bad in some resamples and really good in others would still be a bit of a worry so we might not want to use it anyway.

 

When we tell caret to use adaptive resampling it builds a model for each hyperparameter combination and then assesses them all on only a few resamples. It then performs an anova test on the results to see which combinations are performing significantly worse than the current best. These are then dropped from the process. The remaining models then get tested on more resamples before another anova test takes place with the significantly worse performers removed. This carries on until either there is only one, significantly better hyperparameter combination left or until all the resampling has been completed and the best model is selected as normal.

 

This way, rather than try every combination on every single resample, which can take a long time, we try every combination on only a few resamples and then only persist with the best ones. This can make the training process a lot faster! Let’s try it with glmnet by recreating our 25 random grid search combinations. To tell caret to perform adaptive resampling we have to specify that we want to use it in our 'trainControl':

We can see that we got to our optimal hyperparameter combination in 81 seconds using adaptive resampling compared to 115 seconds when we ran it using an exhaustive grid search. That's roughly a 30% reduction in run time!

Case Study: Neural Networks

 

In the previous post we looked at 10 different modelling algorithms and how we can transform our data to better suit their needs. There was one notable exception in our list of models through: neural networks. Let’s have a quick look at this now and put together all our learnings about data transformations and hyperparameter tunings to improve our model performance.

 

To start with, we need to add an extra line of code to our train() function. The ‘nnet’ package requires the extra option ‘linout = T’ in train() to tell caret that we want to use the neural network in a regression problem. Let's try running it on our data without any transformations or tunings to establish a baseline for performance:

Well that's pretty terrible! Neural networks really benefit from having their data normalised so let's try that. We can see from the tidymodel's pre-processing guide that they can also benefit from having the transformed to look more normal. We can impute the missing values and remove near-zero variance features here too. Let’s try running it again but this time with transformed data:

Better but still not amazing. Remember we were getting a MAE of 0.12 with our regular linear regression. Neural networks can also benefit a lot from hyperparameters tuning. Let's reuse our adaptive resampling object to speed up the search process and we can pass 'tuneLength=5' to have caret construct a grid of values and perform an exhaustive search for us:

That's more like it! A MAE of ~0.9 compares favourably with the best performance of our other models so far. It also demonstrates the importance of correctly pre-processing our data and why it's often worth spending the time tuning model hyperparameters. In the next section we'll see how we can try and squeeze even more performance from our by combining them into a stack.

Nested resampling when hyperparameter tuning

 

Before we move on to look at model stacking it’s worth mentioned the idea of ‘double’ or ‘nested’ k-fold cross-validation. There’s an example demonstrating the process from tidymodels here. We’ll look at this in more detail in the next post on feature selection but you’ll sometimes see people recommending it as an approach when tuning hyperparameters too. Essentially, the worry is that by tuning lots of different hyperparameter combinations on the same data that we are ultimately using to also measure our model performance (the resamples), we can end up with an overly optimistic view of our final model performance.

 

For example, say we try a model and run it with hundreds of different hyperparameter combinations. We test each of them on our resampled data and then pick the one that happens to perform the best. That ‘best’ performance is going to be a mix of the model genuinely having performed better and some (hopefully small) element of it having just got lucky that whatever parameters it tried happened to perform better on our resampled data. The chance of it ‘getting lucky’ reduces if we have a robust resampling scheme but also increases the more combinations we throw at it. Our Test set will still give us our final, unbiased estimate of the model performance but the worry is that potentially we could pick a ‘lucky’ best model from our hyperparameter tuning on our resampling and miss out on the ‘true’ best model.

 

One way round this is to perform nested cross validation where the tuning takes place on some inner resamples before being tested on some outer resamples. This way, for the hyperparameter combination that is picked as the best one in the inner loop, we'll still have a true measure of its performance on the outer loop i.e. if it just got lucky on the inner loop this would be shown by a drop in performance on the outer. The downside to this is that we have to build even more models! Most modelling packages don’t include nested cross validation for hyperparameter tuning by default and you can read a great answer on stack overflow from Max on why optimisation bias with hyperparameter tuning can be an issue, but often in reality it’s not too big of one.

Model stacking with caretEnsemble

A popular way to try and squeeze an extra bit of performance out of a modelling process is to try combining or ‘ensembling’ different models into a sort of ‘super model’. We can do this using the caretEnsemble package and there’s a great tutorial from the author of the package here. Stacking models involves taking the predictions from already trained models and then passing those predictions to another model to use as its input features. The idea is that by combining the outputs of multiple models we can create something greater than the sum of its parts e.g. where one model might have done poorly another might do better (and vice versa) so by combining them the hope is that we get a better overall performance than either model could have managed individually.

The main things caretEnsemble needs is for us to save our predictions on each of the resamples as it is these estimates that are used to create the final ensemble. The other necessity is that the models are all trained on exactly the same resamples i.e. the indices of the held-out rows do not vary between runs. This way it ensures each model is trained and assessed on the same data. We can either specify our resampling indexes manually or caretEnsemble can helpfully perform the job for us. The package uses a caretList() object define each of the candidate models we want to create and that will then feed into our final ensemble.

 

First let’s define the resampling method we want to use. We’ll use adaptive resampling to try and speed the process up with a random hyperparameter search. We’ll set savePredictions="final" to keep the resample predictions from the final model that caret chooses.

Annoyingly, caretEnsemble doesn't seem to support imputing missing values as part of the training process so we'll need to do that ourselves beforehand. Like in the first post we'll take a sample from Train to estimate our pre-processing parameters on and then apply these to train-minus-sample:

Next up is our caretList() object. We have two options here. The first looks very similar to our train() specification where we simply pass a general set of instructions to run on all the modelling algorithms we want to try e.g. metric, preprocess, tuneLength, etc. and then the only change is that instead of 'method=' we specify a 'methodList=' which is where we tell it which models to run. Let’s try calling this for our previous top 5 models (we'll leave out nnet for now due to its lineout=T requirement):

If we call the object we see we get a print out for each model just like if we'd run it in train().  I've only showed the output for glmnet as it gets rather long. What we've done is essentially build all our models exactly like if we'd created them all separately with their own train() calls except by doing them as a caretList() we guarantee they were all built on the same resamples. The downside with caretList() is that we had to leave out nnet as it requires an extra option. We also have to use the same pre-processing and tuneLength options for all models.

 

If we want a bit more flexibility in how we call our models, we can specify them directly with 'tuneList='. This allows us to individually specify any options for each of our models e.g. different pre-processing schemes, any extra options and even how many hyperparameter combinations to consider. As we've already done our pre-processing in advance we'll just use it to include nnet, tell ranger to calculate variable importance and let some of the faster algorithms do more hyperparameter tuning.

caretensemble.PNG

It looks like once again the cubist model performed best, closely followed by the neural network. Before we ensemble or stack our models we can check what the correlations between their predictions looks like. This can be useful to do as it might influence how we combine our different models. When we stack models, each of the models' predictions becomes the input variables into a new model i.e. the super learner takes as input features the predictions from the previous models. Therefore if the predictions are all strongly correlated we might want to leave some out or stack them using an algorithm that doesn't mind correlated features. We can use the modelCor() function to check for correlation between our models:

We can see that, unsurprising, there's a strong correlation between each of our models. The support vector machine and neural network have the lowest average correlation whereas the cubist model and xgbTree have the strongest.

We can now try a simple linear ensemble of our 6 individual models using the caretEnsemble() function. It takes our caretList object and runs a linear model on it to create our stacked model. We can tell it that we want to use MAE as the error metric to report on too:

We can see from the print out the 6 models were used in the stack and a generalised linear model was used to combine them. The performance measure used was the standard 25 bootstraps resampling that caret uses and from this our performance estimate for our stacked model is 0.088 MAE. From our summary() function we can see how this compares to each model individually. It looks like it's a slight improvement over our base cubist model which was our strongest performer until now.

If we wanted to use a different resampling scheme for assessing our stacked model we can do this by including a 'trControl=' with our caretEnsemble. The documentation recommends creating a different trainControl object for the ensembling so we'll do that:

It looks like our simple linear stack still outperforms each of our individual models.

 

As well as simple linear combinations, we can use any of the algorithms in caret to create our stacked model. Using more complex models can result in overfitting when stacking but we'll give it a go. Since the cubist model has been performing well so far, let's try using it to create our stack. We can also see how it compares to a glmnet stack. Once again, we'll create a new trainControl object and use adaptive resampling to speed up our hyperparameter search:

I've only included the print out here for the cubist stack as the glmnet one is very long with all the different hyperparameters tried. We can see that all the different models are around 0.8 MAE. Let's compare all of our final models to see which performed best. There don't seem to be as many helper functions for working with caretList objects so we have to work with each of the elements individually. We can just bind all the outputs into a tibble and sort it to make comparing the results easier:

It looks like the cubist ensemble of our top 6 models performs the best but it's pretty close across any of the top 4 models in the list. Once again, it's worth thinking about the context in which the model has been built. How much is the additional performance between the stacked model and the regular cubist model actually needed? We sacrifice interpretability by stacking our model and probably increase the time it takes to score new data. On the other hand, it looks like we've managed to squeeze some extra performance out of our modelling process which might be worth it in some cases.

Let's go ahead and score up our Test set to double check our model isn't overfitting:

It looks like our cubist stack model does indeed perform well on the Test set scoring 0.0802 MAE compared to 0.0849 on the resamples. Congratulations on successfully creating a super learner!

Next steps: feature selection

In the final post in this miniseries on caret we'll cover how to perform feature selection. Max has also written an entire book on the subject that you can read for free here. Although something you'd typically do earlier in the modelling workflow, I've left it until the end as to do feature selection well requires a good understanding of some of the core model building concepts e.g. data leakage, over-fitting and the importance of having a robust resampling scheme.

 

In the next post we'll see why it's so important to perform feature selection in a robust manner and look at how we can use nested cross-validation to do so. Hopefully by removing redundant features from our model we can improve its performance even more!

Stacking
Nested resampling
Neural Networks
Feature Selection
bottom of page