EASE tutorial for building a state of the art recommender model in 6 lines of numpy
In this post we're going to see how we can build the EASE recommender model in Python. We'll learn some of the theory behind the model and how it can create state of the art recommendations with just a few lines of numpy.

Contents
An Embarrassingly Shallow Autoencoder (EASE)
This tutorial covers what is probably my favourite recommender algorithm, the Embarrassingly Shallow Autoencoder (EASE) model. Apart from having one the best names for an algorithm (passive aggressive is still my favourite) EASE manages to give state of the art performance, needs only a few lines of numpy to implement and has only 1 hyperparameter to tune. It's also got a few other useful attributes that we'll see later.
An autoencoder is a type of neural network typically used for unsupervised learning. It aims to learn a compressed representation (encoding) of the input data and then reconstruct the original input (decoding) from this representation. The "bottleneck" layer in the middle holds the compressed representation.
Although "autoencoder" is in the name, EASE is not actually a deep learning model, it has no hidden layers in the traditional sense, hence the 'embarrassingly shallow' part of its name. The 'autoencoder' comes from how it follows the idea of reconstructing its input data (in our case a user-item interaction matrix) through a learned representation (an item-item similarity matrix).
EASE is a collaborative filtering model so it works on the principle of "wisdom of the crowd", analysing user and item interactions to learn from and create recommendations e.g. "people who liked X also liked Y". Collaborative filtering models analyse the user-item interaction history directly, agnostic to who or what those items actually are. This is in contrast to content based models that focus on the specific attributes of the items or users themselves e.g. the genre of a movie or the demographic details of users (e.g., age, gender).
The original 2019 paper, 'Embarrassingly Shallow Autoencoders for Sparse Data', by Harald Steck describes EASE as a "linear model that is geared toward sparse data, in particular implicit feedback data for recommender systems." Let's run through what each of these attributes means. EASE is linear model as it only needs linear operations to generate its recommendations. We'll see exactly how this works in the next section.
The 'sparse data' refers to the user-item interaction matrix that the model learns from. It's a matrix where rows represent users, columns represent items and the entries indicate whether a user has interacted with an item (e.g., purchased, rated, viewed). Typically, this matrix is very sparse, meaning most entries are zero/missing as most users won't have interacted with most items. It's not unusual for 99% of the possible user-item interactions in one of these matrices to be over 99% missing.
The implicit feedback means it learns from feedback that is derived indirectly from user behaviour and interactions , without the user explicitly stating their preferences. For example, we might take the fact a user clicked on a product or watch a video as them implicitly showing a preference for it rather than them explicitly telling us so by leaving a 5 star review for it. Implicit feedback is inferred from actions rather than direct statements. It was actually Harald Steck's 2010 paper 'Training and testing of recommender systems on data missing not at random' that highlighted the advantages of using implicit data with the insight that
users often don't interact with items they dislike. This allows us to use the absence of interactions to infer user preferences i.e. we can infer what they do like from what they have interacted with but also what they dislike from what they have avoided.
If we solely rely on users explicitly telling us they like or don't like something then if a user hasn't given a rating for a product we can't infer any further information which severely limits our ability to learn what they dislike. In the paper Steck shows that the "absence of ratings carries useful information for improving the top-k hit rate concerning all items" which gives implicit feedback an edge when it comes to best capturing user preferences. On top of using implicit feedback, EASE also works best when the data is encoded in a binary fashion i.e. 1/0 to represent if a user did/didn't interact with an item.
Data Preparation
Let's go ahead and read in our data and prep it ready for modelling then we can create our first EASE model to understand more about how it works under the hood. We'll be using the fashion recommendation from this H&M Kaggle competition. The data is very large so we'll only keep the columns we need. Some of the ID columns are also stored as very long hashed string values that take up a lot of memory so we'll replace them with integer mappings with sklearn. One of the main challenges with EASE is that it's a memory intensive model so anything we can do upfront to decrease the size of our training data is helpful.
Above is a print out of our order data. We can see that we've got UserID, itemID and t_dat which is our date column. EASE only needs the userID and itemID to train from as it doesn't care about the order or sequence in which customers interacted with items. We'll use the t_dat column to split our data into a training, validation and test set later.
First we'll do a bit more data prep by removing any repeat purchases of items by customers as usually for recommendations we want to recommend new items to customers rather than ones they've bought before. EASE also uses a binary interaction matrix i.e. 1/0s so it doesn't care if a customer has bought something more than once. We'll also filter to only use the latest 18 months of data in the table to arbitrarily reduce the size of it whilst giving us a long enough time period to train and predict on.
The original competition supplied over 2 years' worth of data and had a prediction window of just one week. For this tutorial we'll use a 6 month training window with a 6 month prediction window. This is to give most customers who are in our prediction set the chance to actually shop again and so we can measure our performance on lots of customers rather than the handful that shop in any given week. In a real world setting we'd likely choose a shorter window based on our retraining and prediction schedule i.e. retrain weekly to predict the next 7 days.
We'll actually take a chronological train, validation and test split. We'll train our initial model on the oldest 6 months of data and tune the hyperparameters to get the prediction on the next 6 months as our validation data. Finally we'll retrain the model with the best hyperparameters on our validation period before predicting on the last 6 months of data as our test period.
As EASE can't handle cold-start users or items (those that are in the prediction period but not in the training period) we'll need to remove them from our prediction sets. As we later want to use the validation set as our training data we actually need two separate copies of the validation period - one with cold start customers and items removed when its being used as our validation set and one where they're included when it's the training period for our test set. To make things a bit clearer I've given the splits below the following names that correspond to how they get used in the workflow.
-
train_val = the oldest 6 months of the data. It provides the training data for the train-validation process.
-
valid = the 6 months after train_val that are its prediction period. It will have customers and items not in train_val removed.
-
train = the same 6 months as valid but with all customers and items included. It is the data we'll train our final model on once we've found the best hyperparameters and we'll create predictions for the set.
-
test = the latest 6 months of data available. Will be used to measure how well the model performs. Will have customers and items that aren't in the train set removed.
We'll also drop the original orders data to free up some more memory.
Even with restricting ourselves to just 6 months of training data, we still have nearly 8 million rows of data! To save our time and CPUs we can apply an additional filter that removes the infrequent customers or incredibly niche products that customers are unlikely to buy. Again you might not want to do this in a real-world setting as you lose coverage of what you can recommend and who to but interestingly it's quite a common practice in the literature on recommenders. In a review of different recommender papers, Sun et al (2020), found that over half the papers they analysed employed filtering of items and users for a minimum number of interactions. More with an aim of shrinking the data to a practical size than any theoretical benefits I'm applying a minimum filter of 20+ for items and users in the data set.
We can see after filtering that we still have ~3.5 million rows in our triain_val data and ~1.8 million in train. Now we know which customers and items have passed our minimum interaction requirement in our training data, we can now go about removing our cold start items and users from our prediction periods. We'll also run some quick stats on the different periods to check the dates, number of users and items.
We can see that our start and end date between the periods and consecutive but don't overlap. It looks like we're missing a few days in our valid/train period which is probably likely due to stores being shut over the Christmas period. Interestingly, despite using the same 6 months and 20 interaction filter for train_val and train we have a lot fewer customers and rows in the train period which suggests customers don't buy as many clothes over winter as they do in the summer!
Now we've got our data prepped and filtered, we can convert our pandas data frames into the csr_matrices that EASE works with. To do this I'm actually going to borrow the helper function from LightFM, another recommender package that you can read more about here. We'll use lightfm to convert our user and item IDs into integer mappings that can then be used to create our csr_matrix. We'll store the mappings and their inverse so we can get back to our original data.
You might notice that although we had different numbers of customers in train_val and valid their csr equivalents are actually the same shape. This is because lightfm needs to create rows and columns for every user and items it has ever seen to preserve the unique index mapping i.e. it knows that row 0 always refers to user 0 across all matrices it creates. It simply leaves those values 0/blank if the mapping doesn't find its respective customer or user in the data that we pass it.
We'll also define a 'precision @ k' scoring function to help us assess how well our recommendations are performing. Precision at k measures, for each customer, how many of the k-number of predictions we made did the customer actually go on to interact with. It's a metric commonly used in information retrieval and recommendation systems to evaluate the accuracy of a ranked list of predictions.
EASE in a few lines of numpy
Now we've got our data prepped, we can build our first model.
The core of EASE lies in creating an item-item similarity matrix, which we'll call B. This matrix captures the relationships between items based on user interaction patterns. The goal is to learn a matrix B such that when we multiply it by our original user-item matrix X, we can reconstruct X. The predicted score for a user u and an item i is calculated by summing the similarities of all items the user has interacted with to item i.
Let's break down how each line of code builds this model.
1. The Gram Matrix: Finding Item Co-occurrence
Python
G = X.T.dot(X).toarray().astype(np.float64)
This first step calculates the Gram matrix, a fundamental concept in linear algebra.
-
X: This is our user-item interaction matrix, where rows are users and columns are items. A value of 1 in X[u, i] means user u has interacted with item i.
-
X.T: This is the transpose of X. We flip the matrix so that rows become items and columns become users.
-
.dot(X): We then perform a matrix multiplication (a dot product) between the transposed matrix X.T and the original matrix X.
The resulting matrix, G, is a square item-item matrix. Each entry G[i, j] represents the number of users who have interacted with both item i and item j. Essentially, G is a co-occurrence matrix. A high value indicates that items i and j often appear together in users' interaction histories, suggesting they are similar.
2. Regularization: Preventing Overfitting
Python
diagIndices = np.diag_indices(G.shape[0]) G[diagIndices] += lambda_
This step applies L2 regularization, a standard technique in machine learning to prevent overfitting. Overfitting is when a model learns the training data too well, including its random noise, which makes it perform poorly on new, unseen data.
By adding a small penalty term, regularization helps to create a more generalized and stable model. In EASE, it ensures the Gram matrix G can be inverted (a required step coming next) and prevents the model from assigning excessive importance to any single item.
-
lambda_: This is the regularization parameter, the only hyperparameter in the EASE model. A hyperparameter is a configuration setting that is external to the model and whose value cannot be estimated from data. You have to tune it to find the best value for your specific dataset.
-
G[diagIndices] += lambda_: We add this small lambda_ value to the main diagonal of the Gram matrix G. In matrix terms, this operation is G=XTX+lambdaI, where I is the identity matrix.
3. Inverting the Matrix
Python
P = np.linalg.inv(G)
Here, we compute the inverse of our regularized Gram matrix G. Finding the inverse of a matrix is analogous to finding the reciprocal of a number (e.g., the inverse of 2 is 1/2). This mathematical operation is central to solving the system of linear equations that defines the EASE model. The goal is to find the item-item weights in matrix B, and inverting G is the key to isolating B.
4. Creating the EASE Matrix (B)
Python
B = P / (-np.diag(P)) B[diagIndices] = 0
These two lines finalize the item-item similarity matrix B.
-
B = P / (-np.diag(P)): This is a normalization step. It scales the columns of the inverted matrix P. This ensures the final weights are appropriately scaled.
-
B[diagIndices] = 0: This is the crucial constraint that gives EASE its power. We set the main diagonal of our final matrix B to zero.
Why do we do this? The matrix B learns the influence of one item on another. The value B[i, j] represents how much item i's presence in a user's history should increase their predicted score for item j. It makes no sense for an item to recommend itself (e.g., "people who watched Top Gun: Maverick also watched Top Gun: Maverick"). This is a trivial insight. By forcing the diagonal to zero, we ensure the model only learns meaningful relationships between different items.
5. Making Predictions 🚀
Python
preds = train_val_matrix.dot(B)
With our final item-item similarity matrix B ready, making predictions is simple. We perform a dot product between the original user-item matrix (train_val_matrix or X) and B.
The result, preds, is a new matrix of the same dimensions as X. However, instead of binary 1s and 0s, it's filled with predicted scores. For a user u and an item i they haven't seen before (X[u, i] = 0), the new value preds[u, i] is the model's predicted preference score.
This score is calculated by taking all the items the user has interacted with and summing their learned similarity weights from matrix B. For example, a user's predicted score for Top Gun: Maverick would be a weighted sum of their interactions with other movies, like Mission: Impossible and The Edge of Tomorrow, based on the learned similarities in B. The recommendations with the highest scores are then presented to the user.
We can see that what we get out are 2 sparse matrices. The first is our interactions matrix which records user-item interaction and is 1 row per user and 1 column per item with a 1 where an interaction took place. The other sparse matrix is our weight matrix. If we didn't pass weights to the function this would be identical to our interactions matrix but as we did use weights this matrix is of the same shape but records the individual weights for the interactions. Let's convert them to dense types and see how they differ.
So we can see how the matrices both record user-item interactions. Our interaction matrix just has a 1/0 to record whether an interaction took place. Our weight matrix records the same interactions but also the weight for them e.g. on row 2 column 3 we can see an interaction took place and then on the same position in the weight matrix we can see that interaction has the value of 4. Let's now create our interactions matrices for our test data. We don't need to worry about weights for these as they are only used in training although note that LightFM still creates a weight matrix by default.
ur interaction matrices we can create our first model! To start with we'll just use a vanilla matrix factorisation approach without any additional features. Frist we define our model by calling LightFM() and setting the different parameters. I've pretty much left them on the default options but written them out anyway to give an idea of the options we have. Once we've defined our model we can call fit() to train the model and pass it our interactions matrix. The default value only trains the model for 1 epoch so I've set it to 20.
Assessing recommenders is slightly different to normal regression/classification problems. We're using binary did/didn't interact data as our target so it seems like a classification problem. However most users don't interact with most items so a measure like Accuracy isn't suitable as we'd probably get 99%+ Accuracy just by predicting 0 for everyone. We might also have some users who are highly likely to buy lots of wines but equally we'll have some who are very unlikely to buy anything at all. For normal classification we'd want to predict 0 for users who are unlikely to buy anything. However in a recommender setting this isn't an option. If we have 10 recommendation slots on the webpage to fill for each user, we can't just leave them blank because we don't think they're likely to buy anything, we still need to show them something.
So our first model gets an average Precision @ 10 of 0.21 on the Training data but this drops quite a lot of 0.10 on the Test data and even further to 0.047 for predicting which new items users might go on to purchase. We can see though that Recall @ 10 is 0.36 which means we're still capturing over 1/3 of the products users do go on to buy in our recommendations, it just looks like most customers aren't buying many wines in general.
To use the predict() function in LightFM we need to pass it a list of User IDs and Item IDs in a slightly idiosyncratic format. Referring to the documentation "if you wish to generate the score for a few items (e.g. [7, 8, 9]) for two users (e.g. [0, 1]), a proper way to call this method would be to use lfm.predict([0, 0, 0, 1, 1, 1], [7, 8, 9, 7, 8, 9]), and _not_ lfm.predict([0, 1], [7, 8, 9]) as you may initially expect". So essentially we need a repeated value of User ID to pair against each item ID we want a prediction for. To get all predictions for all users at once we can do some list building before passing them to predict.
In the output above we've got 1 row per User ID and 1 column per Item ID in the order of their LightFM mapping indices e.g. LightFM User Index 0 is our first row and LightFM Item ID Index 0 is our first column. The actual scores of the predictions are meaningless apart from as a means of creating the rankings i.e. the prediction results are not probabilities and are not comparable across users.
In the output above we've got 1 row per User ID and 1 column per Item ID in the order of their LightFM mapping indices e.g. LightFM User Index 0 is our first row and LightFM Item ID Index 0 is our first column. The actual scores of the predictions are meaningless apart from as a means of creating the rankings i.e. the prediction results are not probabilities and are not comparable across users.
In the output above we've got 1 row per User ID and 1 column per Item ID in the order of their LightFM mapping indices e.g. LightFM User Index 0 is our first row and LightFM Item ID Index 0 is our first column. The actual scores of the predictions are meaningless apart from as a means of creating the rankings i.e. the prediction results are not probabilities and are not comparable across users.
In the output above we've got 1 row per User ID and 1 column per Item ID in the order of their LightFM mapping indices e.g. LightFM User Index 0 is our first row and LightFM Item ID Index 0 is our first column. The actual scores of the predictions are meaningless apart from as a means of creating the rankings i.e. the prediction results are not probabilities and are not comparable across users.
In the output above we've got 1 row per User ID and 1 column per Item ID in the order of their LightFM mapping indices e.g. LightFM User Index 0 is our first row and LightFM Item ID Index 0 is our first column. The actual scores of the predictions are meaningless apart from as a means of creating the rankings i.e. the prediction results are not probabilities and are not comparable across users.
Let's convert the recommendations into something a bit more intelligible by extracting the top 10 recommended items for each user. We'll bring through the top 10 recommendations including and excluding previous purchases as well as the previous purchases themselves so we can see if our new recommendations are a good match with their historical preferences.
We can see that our user had previously purchased the Petite Syrah, Malbec and Merlot amongst others and that LightFM would have re-recommended all of them in the top 10 when making predictions. Let's try removing any previously purchased wines from the recommendations to see how they change.
Hyperparameter tuning
One easy way to do this is to reuse our training interactions matrix which records all previous purchases as a 1, multiply it by a large number and then simply subtract it from our scores to artificially downweight the scores for all previously purchased lines. Let's try it now.
This time we get a list of recommendations that are completely new to the user. We can see that the Pinot Noir that was previously 3rd on the list is now at the top. We can see that some of the other top selling wines e.g. the Chardonnay and Sauvignon Blanc also make it onto the list. Finding that the model ends up recommending top sellers can be quite common. Although it's not necessarily a bad thing if we want to try and recommends more unusual or less popular items there is a quick fix we can try with LightFM.
As well as user and item representations the model learns user and item biases too. Commonly these do the job of capturing how popular an item is and then boost that item's score in the final prediction. To make predictions without any notion of popularity we can simply redo our dot product without the biases:
Although the Petite Syrah is still at the top, the other recommendations look a lot more obscure. We can also overwrite our model's biases (or make a copy of it and then do it!) with 0s and then use LightFM's predict() and evaluation functions as normal.
Finding similar items
Let's see how much of a drop off in performance we get by setting the biases to 0.
Although the Petite Syrah is still at the top, the other recommendations look a lot more obscure. We can also overwrite our model's biases (or make a copy of it and then do it!) with 0s and then use LightFM's predict()
Plotting similar items with TSNE
and evaluation functions as normal. Let's see how much of a drop off in performance we get by setting the biases to 0.
We can see there's quite a big drop in performance, particularly for when finding entirely new wines for users. So for now we'll leave the biases in our future models
Summary
We'll use the Optuna package which will try to automatically find the optimal set of hyperparameters for us from our search space. It does this by conducting repeated trials and modelling LightFM's performance as a function of the different hyperparameters and values that we gave Optuna to use.
To use Optuna we first create a 'study' which is essentially our hyperparameter search space, the data sets we want to use and our assessment metric which we then return. To avoid repeatedly using our Test data we'll split out Train into a smaller Train and Validation set using LightFM's train_test_split() function. One thing to note with this is that as the data is split randomly it doesn't preserve the chronology of purchases like our actual Train-Test data does. There is also the chance that as our data is so sparse we might have some users where all of their interactions fall into either the Train or Validation set. There's no possibility for repeat purchase data so our tuning run will be most closely resembling Train and Test-new for tuning purposes.
Another great feature of Optuna is we can pass in our original hyperparameter values to give it a 'warm start' in terms of values to explore and a baseline performance that it needs to beat when running the optimisation. Although we didn't use any regularisation on the original model, to keep the parameters on the same log-scale as the trial values, we'll give it the bare minimum.
I've set my study to run for 50 trials. Feel free to try more or less depending on how quickly it trains for you. Another nice feature of Optuna is that the best parameters from all the studies seen so far are kept so if you interrupt it you don't lose all of the learnings up to that point. Once the study is finished we can print out the best hyperparameter values.
We can see that the best model had quite a high number of components (the number of dimensions in the embeddings) and quite a low level of regularisation. Optuna actually has a function that attempts to measure how important each hyperparameter was in terms of contributing to the final performance of the model. It uses a random forest and the hyperparameter values at each iteration to try and predict the trial-model performance for that iteration. Let's see which hyperparameters Optuna thinks had a bigger impact on our final model's performance.
So it looks like the loss value had the biggest and then a distant second was item_aplha. Let's now try training the final model on 100% of Train and see how it performance on our Test data.
Our tuned model shows a marginal improvement on the Test-new data so it looks like it's been successful. If we wanted to we could run more trials in the hope that the performance continues to improve. Let's extract the user and item embeddings from the new model and see if our similar items make more sense now.
The similar items don't actually look that similar at this point and we seem to be getting a mix of red and white wines which we wouldn't really want. It looks like it's the same top selling lines e.g. cabernet savignon and chardonnay that are appearing in each list. This is probably due to the current sparsity of our data i.e. without much information to draw upon LightFM found it a sensible strategy to recommend top selling lines. This works for generating recommendations but for our item-item associations, recommending a cabernet sauvignon to someone who has just bought a sauvignon blanc doesn't feel ideal.
Weighting interactions
At the start we created some user-item weightings to reflect that users buy some items more often than others. Our initial models have just been treating all interactions equally but let's now try running it with the weights. As well as upweighting more important interactions we could have downweighted less important ones. This is one way that's suggested to deal with very popular items to stop them from always being recommended and try to make our recommendations more diverse.
To use weights with the train_test_split() we need to pass it separately along with the same random_state to ensure the splits happen in the same place. We can then pass whether or not to use weights as an extra hyperparameter to Optuna to see if it finds any benefit from their inclusion.
So it looks like Optuna found that not using interaction weights (or at least the ones we created at the start) didn't improve performance. This is probably not surprising given we upweighted any items regularly purchased by users but it seems like most users only buy a few wines so the difference was always going to be marginal. For completeness, let's train our newest model on Train and see how it does.
Still performing at around the 5% mark for Test-new precision at 10. Maybe we can try adding in some extra item features to try and boost performance.
Create item features
Interestingly when I was reading up on LightFM there are quite a lot of examples online of people reporting that including additional features actually made their models worse. This seemed surprising as we'd normally expect having access to additional data to be a good thing in machine learning. Reading more on the subject it seemed like there were two main reasons for this.
The first is that by including extra features we actually restrict the expressiveness of the model which is mentioned in a note in the documentation here. This makes sense as we go from having 1 feature per user/item whose job is just to individually represent that user/item in the most useful possible way to combining it with more generalised features that are shared across items/users. These more generalised features can be a good thing as our final representations are more general and so less likely to overfit and can work better in sparser/cold-start scenarios but there is a balance to be struck which leads to reason number two.
There's a good discussion on github around including features and how we need to be careful to only include meaningful/useful features as "if you add lots of uninformative features they will degrade your model by diluting the information provided by your good features". Essentially when creating the final representation LightFM goes user/item = sum(features * weights) so if we put loads of uninformative features into the model the final representations will be largely uninformative too. This means we actually need to practice the slightly old-school data science skill of feature engineering! Another option we'll explore later is adjusting the feature weights so that they have less impact on the final representations.
For now let's try and create some useful features that should be helpful when making wine recommendations. To do this I created ngrams out of all the wine names (extract the individual words or word sequences) and did a count to see which were the most common. I then used these to make features that I thought would be most useful which are broadly wine colour, country of origin, grape types, style, etc.
The script below first tidies up multiple spellings or the same feature having different names e.g. porto and port both refer to Port Wine. Some googling also showed that certain styles of wine or certain regions are linked to certain countries so I was able to extract a bit more country of origin data from those too. This is by no means a comprehensive list so if you want to try adding your own feel free. I also subsequently learnt that the protected term of 'champagne' can actually be used for a limited selection of Californian wines whereas normally it'd indicate an item is from the Champagne region in France.
Now we've got our list of item features created, let's see which are the most common ones.
So it looks like most of our wines are red or white with a few sparkling. The most common country is USA with 148 wines and then France and Italy with around 40 each. One thing to note when making features for LightFM is that as we later create index lookups for them, each feature needs to be uniquely named. We'll also go ahead and remove any completely blank columns for features that didn't match to any wines.
Now we've created our item features we can remake all of our mappings.
We can see from the above that our item mappings (index to item ID) is now shorter than our item metadata mappings which now has an index for each item ID + each feature ID. Let's create our inverse mappings and build our interactions. For features data LightFM likes to have a list of (user/item id, [feature1, feature2]) or (user/item id, {feature1: feature1_weight, feature2: feature2_weight}). Since we're not using weights yet we'll just create a list of item and features.
By default LightFM normalises (makes sure they all sum to 1) all of the features in the weight matrix. This is generally advisable as since we sum up all the embeddings for each feature to create the final representations we want our final representations to roughly all be on the same scale. For example, if an item with 3 features had a final representation 3x the size of an item with 1 feature then this would potentially skew things when we calculate the dot product as that is sensitive to the underlying size of the embeddings e.g. an item with lots of features could get a boosted score simply from having lots of features. This is what the weight matrix and the normalisation helps avoid. For our 3 feature item, its final representation would instead be (1/3*feature1) + (1/3 * feature2) + (1/3* feature3) instead of (feature1 + feature2 + feature3).
We can see some examples of the mappings and how they seem to be working pretty well. The products are all picked out as sparkling wines with 'brut' and 'rose' as additional characteristics. We can also see one of the 'Californian Champagnes' causing issues with the country of origin assignment! One thing to note is that we only include features that the users/items do have as opposed to recording them as not having those specific features.
Let's now try running Optuna but this time when we fit our model we can pass in our list of items and their features.
Let's take the best parameters and train our final model and see how it performs.
Performance is still around 5% precision @ 10 so at least we didn't make our model worse! Hopefully the added benefit of including the item features is that our suggested item-item recommendations are improved too. Let's see if that's the case.
Now these look a lot better! A large part of this will be due to the fact that our item tags will force items that share tags to at least be partially similar i.e. any shared tags between items means they'll also share the embedding for that tag in their final representation. This is why the model with metadata is less expressive as we're constraining the final representations to be more generalised i.e. rather than each item getting a bespoke embedding they're now the sum of their own bespoke embedding + more general embeddings of any tags that might be shared across products.
In theory this can stop the model overfitting although often people report it negatively impacting model performance. However there are a couple of powerful upsides to including metadata that might make a slight drop off in predictive power worthwhile. One of the main ones that we'll look at later is we can now make recommendations for new or cold-start products. For example, if we have a new French Cabernet Sauvignon, we don't have any user interactions for the product but we can still create a representation of it by summing up the already learnt embeddings for 'France' + 'Cabernet Sauvignon'. We can then either find similar items or predict which users might like it based on those features. We'll see how to do this with LightFM in a bit.
The other benefit that we can see above is that our item-item recommendations make a lot more sense and are easier to understand. A lot of this is because we're forcing items with shared features to share large proportions of their final representations but the hope is that even if it's the sharing of features that drives our top recommendations, each item still has its own bespoke identity embedding that we can learn from. For example, the top suggestions for buyers of the 'Cabernet Sauvignon' is the 'Cabernet Sauvignon, North Coast, 2011' which is just ahead of the 'Cabernet Sauvignon, North Coast, 2012'. This tells us that even amongst the 'red wine' + 'cabernet sauvignon' shared tags of the top wines the North Coast is the best fit and actually it can pick out which vintage is most appropriate since each year is attached to a separate item.
It's also useful to keep an eye out for other similar items that share fewer tags as this is the model telling us that although we tagged the items differently, the collaborative filtering exercise tells us that customers view the features (or the items attached to them) as actually being very similar. For example, looking at the top 10 suggested items for the Malbec we actually quite quickly move into Shiraz wines which tells us that a lot of Instacart users are buying both Malbec and Shirazs.
Adjusting feature weights with tf-idf
Before we move on to looking at user features let's try adjusting our weights for the items. At the moment the item and its features are weighted equally so if an item has 3 features, the final representation of that item is: 1/4 item identity embedding + 3/4 features embeddings. One way to tip the weightings in favour of a more expressive model whilst still retaining the benefit of making cold-start predictions and sensible item-item suggestions is to downweight the features in the final representations so more of the representation comes from the bespoke item identity embeddings. We could experiment with a few different weighting schemes and then use optuna to find the best one. For now I'm going to use sklearn's tfidf function to downweight common tags e.g. 'red wine' in the hope that it allows the individual item or more unusual features e.g. 'shiraz' to come to the fore.
To do this, we first need to create a data frame that has for each item, all of the unique features associated to it in the form a long text string.
These look pretty good. We can also see on row 4 the challenge with keyword searches where we've got a product as belonging to the USA and France! Now we've processed our text data we can call TfidfVectorizer() to create our weightings of each of the tags for each of the items. The code below creates a pandas data frame with a row for each product and a column for each item feature e.g. 'sparkling'. The value of the column is the associated tf-idf weight for that product and item feature. We can then loop through each row and return a dictionary of each item feature and its weight, filtering for where weights are >0:
We can now see for our first item 'Mirabella Rose Brut' that it has 3 features 'Rose', 'Brut' and 'Sparkling'. We can see that 'Rose' receives the highest rating which makes sense. The fact the item is a sparkling wine is definitely important, but the fact it's also a rose is probably more so. Let's now try running optuna with our tf-idf item matrix as one of the possible hyperparameters:
In this case the original evenly split weightings perform better (at least for the number of trials we ran). That keeps things nice and simple and actually makes the final model easier to explain which is a plus! Let's train the best model from our trial and see how it does overall.
Marginally worse but still around the 5% precision at 10 mark. So far we've seen how the item features can improve our item-item recommendations by encouraging products with similar features to be scored more closely. Let's try a more visual representation of this by reducing our embeddings down using t-sne.
Plotting associations with t-sne
Our hyperparameter 'no_components' control the size of the embeddings that are learnt for our items and users. Usually these are too big to plot so we need a way of reducing down the dimensions. We can use t-sne to do this. I'll create a couple of categorical columns that summarise some of the characteristics about the products and then we can plot the associations to see how wines with different attributes group together.

This is pretty cool. Each dot on the plot represents a product and the distance between them is t-sne's best attempt at condensing down the embeddings dimensions into 2D. If we then colour each plot by it's respective wine colour we can see that there is a divide between red and white wines, suggesting users tend to stick to a particular colour and that sparkling and rose wines sit somewhere in the middle. Let's do the same plot but just for red wines with their grape type.

Here we can see that the cabernet sauvignon wines tend to cluster together along with the merlots. This makes sense as often Merlot and Cabernet Sauvignon are blended together. We can see the blue cluster of products is all the Pinot Noirs and then there's a big group of unknown wines with the Syrahs and Malbecs in the middle.
So far we've looked at the associations between the items and see how the features data can help create more intuitive item-item recommendations. A nice perk of how LightFM works by learning separate embeddings for everything means we can also look at associations not just between items and users but also been their features.
Associations between item features
The LightFM documentation has an example of this but essentially we can calculate the cosine similarity between features in exactly the same way we do for items. We simply extract their representations directly from the model rather than using the get_representations(). We can then see what other features are related to each other.
For the feature 'Bordeaux', a famous wine region in France we see the most similar features are other French and European wine regions. For 'Cava', a sparkling wine made in Spain, we see it has a strong association to 'Rioja', a Spanish red wine, as well as some Italian features. Interestingly 'Malbec' and 'Argentina' also feature on the list. The feature embeddings can provide useful insight into the category. For example, although users find cava and prosecco similar, the relationship with champagne (another sparkling wine) obviously isn't as strong.
Recommendations for cold-start items
It's actually the embeddings for the item features that also allow us to make recommendation for new or cold-start items. These items don't have any user interactions so we can't create embeddings for them directly. However what we can do is express the item in terms of the features that we do have embeddings for. For example, let's say we're launching a new red wine from Bordeaux that's a merlot-cabernet sauvignon blend. We don't have an identity feature for the wine as it hasn't been interacted with yet but we can still create a representation of it by summing the embeddings for each feature. First, let's get the item feature indexes for each of our attributes.
We can create weights for each of these. To keep things simple we'll just assign them all the same weight which will be 1 / the number of features. The next part is to create a lookup row for our item that mimics the normal item feature matrix that LightFM is used to receiving. We create an array of all 0s that matches the length of the pre-existing item features. We then overwrite at each index for our feature that 0 with our feature weight. As a check we can sum the row to make sure our weights add up to 1.
Now we've created our cold-start item feature row we can convert it into a sparse matrix and pass it to LightFM. We can use the get_representations() function from LightFM to calculate the sum(weights*embeddings) for our item and then we can calculate the cosine similarity between it and other items.
It looks like LightFM has picked up on the fact our wine is a cabernet sauvignon blend and found us other cabernet sauvignons that it thinks are similar. This way we can find users that bought those items and recommend them our cold-start one on the basis that it's similar so those users should like it too.
If we want to create recommendations for users directly we can do this too. We simply take our cold-item embeddings and calculate the dot product against the user embeddings to create a recommendation score for our new item. We can then append that to all of the predictions we made previously and re-rank them to find users for whom the item ranks highly. Note that we calculate ranks per user rather than just take the highest score for the cold-item as the actual scores in LightFM only have meaning relative to each user as a means of ranking items and not between users.
So it looks like our new item would actually be a very good candidate for a number of users! Nice. Let's double check this by looking at the previous purchases of the top user to make sure a cabernet sauvignon-merlot blend from France would make sense as a recommendation.
Looking at their previous purchases it makes complete sense why our cold-start item would be a good recommendation for them! Now we've looked at adding in item features to our model, let's try adding in some user features too.
Adding in user features
Adding in user features works in exactly the same as item features. First we need to create our features and map them back to customers (with the option to include their weights). Since we don't have any obvious user features to hand e.g. age, gender, etc. let's create some using the other categories users have previously interacted with. If we were doing this properly we'd want to pick just a few of the categories we think are most important and weight them accordingly. To keep things simple we'll just add every other category a user has interacted with and weight them all equally.
This sort of blanket feature creation is probably where we run the risk of diluting down our useful features. On the other hand our data is so sparse (i.e. few users buy more than 1 or 2 wines) that the extra data might still be beneficial in this instance. First we'll get a list of all the non-wine categories our users have interacted with and then keep a unique list of categories to serve as our list of possible user features.
Now we'll create all of our mappings and interactions. The code at the end converts our list of user ID and aisle-shopped data frame into a list of user ID + aisles-shopped list that we can pass to LightFM to create our user feature matrix from.
So we can see for User 21 all of the other categories they have shopped in as features associated to them. The hope is that which other categories users shop contains some information about the types of wines they go on to purchase e.g. users that bought fish might prefer white wine to go with it, users buying organic products might prefer organic or natural wine, etc. Let's go ahead and train and tune our model with item and user features.
Again we get around 5% precision at 10 so it looks like we've not made our model worse by adding in user features. We could probably improve the performance even more by dropping some of the less relevant or useful category behaviours.
Recommendations for cold-start users
Now we've added user features we can make predictions for cold-start users just like we did for cold-start items. We simply create a custom user attribute matrix that has 0s for all the user identity features but we populate with the relevant weights for the categories our new user has previously shopped. We can then use LightFM to create the user representation which we can pass to predict.
It's tricky to know if these seem sensible without testing it against some cold-start users but at least we can see that the process of creating our cold-start user matrix is successfully returning different recommendations for each type of user we created. Just like we did for the item features, we can get LightFM to return the final representation of our cold-start users for us and we can then do our own dot product to create predictions.
That about wraps things up. Hopefully this tutorial has been a useful introduction to the LightFM package. We've seen how we can create recommendations using traditional matrix factorisation approaches and then try and boost performance and tackle the cold-start problem by including item and user features. We also explored tuning the hyperparameters and adjusting the various weight matrices LightFM uses to create its predictions. Well done!