EASE - the Embarrassingly Shallow Autoencoder recommender model
In this post we're going to see how we can build the EASE recommender model in Python. We'll learn some of the theory behind the model and how it can create state of the art recommendations with just a few lines of numpy.

Contents
An 'Embarrassingly Shallow' Autoencoder (EASE)
This tutorial covers what is probably my favourite recommender algorithm, the Embarrassingly Shallow Autoencoder (EASE) model. Apart from having one the best names for an algorithm (passive aggressive is still my favourite) EASE manages to give state of the art performance, needs only a few lines of numpy to implement and has only 1 hyperparameter to tune. It's also got a few other useful attributes that we'll see later.
An autoencoder is usually a type of neural network typically used for unsupervised learning. It aims to learn a compressed representation (encoding) of the input data and then reconstruct the original input (decoding) from this representation. The "bottleneck" layer in the middle holds the compressed representation.
Although "autoencoder" is in its name, EASE is not actually a deep learning model, it has no hidden layers in the traditional sense, hence the 'embarrassingly shallow'. The 'autoencoder' comes from how it follows the idea of reconstructing its input data (in our case a user-item interaction matrix) through a learned representation (an item-item similarity matrix).
In this way, EASE is a collaborative filtering model as it works on the principle of "wisdom of the crowd", analysing user and item interactions to learn from and create recommendations e.g. "people who liked X also liked Y". Collaborative filtering models analyse the user-item interaction history directly, agnostic to who or what those items actually are. This is in contrast to content based models that focus on the specific attributes of the items or users themselves e.g. the genre of a movie or the demographic details of users (e.g., age, gender).
The author of EASE, Harald Steck, described it in his 2019 paper, 'Embarrassingly Shallow Autoencoders for Sparse Data' as a "linear model that is geared toward sparse data, in particular implicit feedback data for recommender systems." Let's run through what each of these means. We mentioned already how the autoencoder inspiration for EASE comes from it trying to reconstruct the original user-item interaction matrix by learning another representation (which for EASE is an item-item similarity matrix). The reconstruction happens by multiplying the original user-item interaction matrix by this item-item weight matrix and then measuring the error of the reconstruction.
The "linear" aspect of EASE refers to how it predicts an item's interaction patterns. For any given item, the model approximates its user interaction vector as a weighted sum (i.e. a linear combination) of the interaction vectors of all other items. The "weights" in this sum are the item-item similarity matrix that the model learns. This task of predicting a target variable (an item's vector) from a linear combination of other variables is essentially a regression problem. EASE formulates this as a least-squares problem with an L2 regularization penalty, which makes it a form of Ridge Regression. This regularization tries to prevent overfitting and help EASE to learn a stable, generalisable similarity matrix. A clever constraint is applied during training: the diagonal of the item-item similarity matrix is set to zero so an item cannot appear identical to itself. This forces the model to learn similarities without "cheating", ensuring the model learns meaningful relationships between different items.
The 'sparse data' refers to the user-item interaction matrix that the model learns from. It's a matrix where rows represent users, columns represent items and the entries indicate whether a user has interacted with an item (e.g., purchased, rated, viewed). Typically, this matrix is very sparse, meaning most entries are zero/missing as most users won't have interacted with most items. It's not unusual that out of all the possible user-item interactions in the matrix for over 99% missing!
The implicit feedback means it learns from feedback that is derived indirectly from user behaviour and interactions, without the user explicitly stating their preferences. For example, we might take the fact a user clicked on a product or watched a video as them implicitly showing a preference for it rather than them explicitly telling us so by leaving a 5 star review for it. Implicit feedback is inferred from actions rather than direct statements. It was actually Harald Steck's 2010 paper 'Training and testing of recommender systems on data missing not at random' that highlighted the advantages of using implicit data with the insight that
users often don't interact with items they dislike. This allows us to use the absence of interactions to infer user preferences i.e. we can infer what they do like from items they have interacted with but also what they dislike from what they have avoided.
If we solely rely on users explicitly telling us they like or don't like something then if a user hasn't given a rating for a product we can't infer any further information which severely limits our ability to learn what they dislike. In the paper Steck shows that the "absence of ratings carries useful information for improving the top-k hit rate concerning all items" which gives implicit feedback an edge when it comes to best capturing user preferences. On top of using implicit feedback, EASE also works best when the data is encoded in a binary fashion i.e. 1/0 to represent if a user did/didn't interact with an item.
Data Preparation
Let's go ahead and read in our data and prep it ready for modelling, then we can create our first EASE model to understand more about how it works under the hood. We'll be using H&M fashion data from this Kaggle competition. The data is very large so we'll only keep the columns we need and take some extra steps to shrink it down to something manageable. Some of the ID columns are stored as very long hashed string values that take up a lot of memory so we'll replace them with integer mappings using sklearn. One of the main challenges with EASE is that it's a memory intensive model so anything we can do upfront to decrease the size of our training data is helpful.
Looking at the 'articles' data frame it looks like we've got a couple of different ways of identifying products. The 'article_id' looks like a unique identifier for a product but there's also the 'prod_name' field that looks like it's the name/description of the product although it doesn't look to be unique. We can run a count on the number of unique occurrences in both variables to see if this is the case.
To make the results a bit more interpretable it'd be useful to use the 'prod_name' for our model which would also mean we can shrink down the purchase data down a bit more. The difference in number of 'prod_name' and 'article_id' could possibly be when it's essentially the same product (so same name) but in a different colour or size which might require it to get its own article_id. For our purposes, we'll take a purchase of any 'Strap top' as an interaction with that product rather than recording interactions with every unique Strap top article ID. We'll also rename 'prod_name' to 'itemID' to make it a bit more generic.
Above is a print out of our order data. We can see that we've got t_dat which is our date column, customer_id which is a very long hashed key and our newly created itemID. EASE only needs a userID and itemID to train from as it doesn't care about the order or sequence in which customers interacted with items. We'll use the t_dat column to split our data into a training, validation and test set later.
Finally we'll do a bit more data prep by replacing the long hashed string value of customer_id with a simple integer index. We'll also shrink our training data by removing any repeat purchases of items by customers, keeping only the first time someone bought the product type. EASE uses a binary interaction matrix i.e. 1/0s so it doesn't care if a customer has bought something more than once. We'll also filter to only use the latest 18 months of data in the table to arbitrarily reduce the size of it whilst giving us a long enough time period to train and predict on.
The original competition supplied over 2 years' worth of data and had a prediction window of just one week. For this tutorial we'll use a 6 month training window with a 6 month prediction window. This is to give most customers who are in our prediction set the chance to actually shop again and so we can measure our performance on lots of customers rather than the handful that shop in any given week. In a real world setting we'd likely choose a shorter window based on our retraining and prediction schedule i.e. retrain weekly to predict the next 7 days.
We'll actually take a chronological train, validation and test split. We'll train our initial model on the oldest 6 months of data and tune the hyperparameters to get the prediction on the next 6 months as our validation data. Finally, we'll retrain the model with the best hyperparameters from our validation period before predicting on the last 6 months of data as our test period.
As EASE can't handle cold-start users or items (those that are in the prediction period but not in the training period) we'll need to remove them from our prediction sets. As we later want to use the validation set as our training data we actually need two separate copies of the validation period - one with cold start customers and items removed when its being used as our validation set and one where they're included when it's the training period for our test set. To make things a bit clearer I've given the splits below the following names that correspond to how they get used in the workflow.
-
train_val = the oldest 6 months of the data. It provides the training data for the train-validation process.
-
valid = the 6 months after train_val that are its prediction period. It will have customers and items not in train_val removed.
-
train = the same 6 months as valid but with all customers and items included. It is the data we'll train our final model on once we've found the best hyperparameters and we'll create predictions for the test set.
-
test = the latest 6 months of data available. Will be used to measure how well the final model performs. Will have customers and items that aren't in the train set removed.
Let's make these data sets and drop the original 'orders' table to free up some more memory.
Even with restricting ourselves to just 6 months of training data, we still have nearly 8 million rows of data! To save our time and CPUs we can apply an additional filter that removes the infrequent customers or incredibly niche products that customers are unlikely to buy. Whilst not something you would want to do in a real-world setting as you'd lose coverage of what you can recommend and to whom, it's quite a common practice in the literature on recommenders. In a review of different recommender papers, Sun et al (2020), found that over half the papers they analysed employed filtering of items and users for a minimum number of interactions. We'll be doing it with the aim of shrinking the data to a practical size though rather than for any theoretical benefits I'm applying quite an aggressive minimum filter of 20+ for items and users in the data set.
We can see after filtering that we still have ~2.96 million rows in our triain_val data and ~1.5 million in train so the number still aren't tiny!. Now we know which customers and items have passed our minimum interaction requirements, we can now go about removing our cold start items and users from our prediction periods. We'll also run some quick stats on the different periods to check the dates, number of users and items.
We can see that our start and end date between the periods are consecutive and don't overlap. It looks like we're missing a few days in our valid/train period which is probably likely due to stores being shut over the Christmas period. Interestingly, despite using the same 6 months and 20 interaction filter for train_val and train we have a lot fewer customers and rows in the train period which suggests customers don't buy as many clothes over winter as they do in the summer!
Now we've got our data prepped and filtered, we can convert our pandas data frames into the csr_matrices that EASE works with. The csr_matrix from scipy efficiently stores sparse matrices by storing only the non-zero elements, along with their row and column indices. This gives us significant memory savings when dealing with matrices containing mostly zeros.
The row and column indices are the key to working with the matrices as they allow us to access the elements and provide a mapping back to our original data. However this can get a bit tricky when switching between our train and test sets which might have different numbers of users and items in them. The easiest way round this is to create a mapping between all our users and items and then construct our csr_matrix from these. This way we ensure the mapping from our original data to the csr_matrix is consistent and we don't lose any memory as only the non-zero values are stored. Let's create some mappings and a function to create our csr matrices.
We can see that even though we had different numbers of customers in train_val and valid their csr equivalents are actually the same shape. This is because we need to create rows and columns for every user and item we've ever seen to preserve the unique index mapping i.e. we know that row 0 always refers to user 0 across all matrices we create even if they didn't shop in any specific time period. It simply leaves those values 0/blank if the mapping doesn't find its respective customer or user in the data that we pass it.
We'll also define a 'precision @ k' scoring function to help us assess how well our recommendations are performing. Precision at k measures for each customer, how many of the k-number of predictions we made did the customer actually go on to interact with. It's a metric commonly used in information retrieval and recommendation systems to evaluate the accuracy of a ranked list of predictions.
EASE in a few lines of numpy
Now we've got our data prepped, we can build our first model. The standard implementation of EASE uses a few different letters to denote different parts of the process starting with X, our interaction data that the model is trained on. To make the next steps a bit more general and easier to follow we'll adopt these naming conventions. We'll also make a global variable 'lambda_' which stores our L2 regularisation value.
The core of EASE lies in creating an item-item similarity matrix, which is usually called B. This matrix captures the relationships between items based on user interaction patterns. The goal is to learn a matrix B such that when we multiply it by our original user-item matrix X, we can reconstruct X. This is where the notion of EASE as an autoencoder comes from i.e. we're running our data through a process with the aim of reconstructing it as the output.
To generate predictions for users, we then dot product the item-item weight matrix against our original interaction matrix. The predicted score for a user 'u' and an item 'i' is calculated by summing the similarities of all items the user has interacted with to item i. All of this can be achieved with a few lines of numpy. Let's see how this looks and then we can run through what each step is doing.
The first step is to calculate the gram matrix, G. In our case this is just the dot product of our interaction matrix with the transpose of itself which is actually just a count of the co-occurrence of each item pair. G is a square item-item matrix where each entry G[i, j] i.e. row and column represents the number of users who have interacted with both item i and item j. A high value indicates that items i and j often appear together in users' interaction histories, suggesting they are similar. The format of G needs to be a float for the calculations it gets used in later on but we can print it as an int to show that all it does it record co-occurrence counts.
We can see that our G gram matrix has a shape of 15,653 x 15,653 which is the number of items in our data. We can also see that the first product was bought 1,131 times and that the first and second were bought by 2 customers together. What items occur together i.e. which ones appeal to the same customers is the foundation for how we can learn how different customers interact with the products and which might be similar or good candidates for a customer based on what else they've bought.
Now that we have our Gram matrix, G, the next step is to apply regularization. This technique is essential for building a robust and effective EASE model. Remember that EASE's primary goal is to learn an item-to-item relationship matrix, B. This matrix captures how different items relate to each other based on user interactions. However, the number of parameters (weights) in this B matrix grows quadratically with the number of items. For example:
-
1,000 items = 1,000 x 1,000 = 1,000,000 parameters
-
10,000 items = 10,000 x 10,000 = 100,000,000 parameters
With so many parameters to learn from what is often sparse user-item data, the model can easily overfit. Overfitting occurs when a model learns the training data too well—including its noise and random fluctuations—and fails to generalize to new, unseen data. This would result in poor quality recommendations.
To combat overfitting, we useL2 regularisation. This technique adds a penalty to the model's objective function, making it "costly" for the model to assign large weights in the B matrix. By discouraging large weights, L2 regularization forces the model to find a simpler, more generalized solution. Instead of relying on a few strong (and potentially misleading) item-to-item connections, the model learns to spread the influence across a broader set of related items. This leads to more stable and reliable recommendations. In our code, we apply our regularisation term lambda_ to the diagonal of our G matrix:
Now we've got our regularised co-occurrence matrix we're ready to do the really clever bit. We compute the inverse of our regularized Gram matrix. This is what allows us to compute EASE as closed-form solution which means it doesn't require iterative optimisation like neural networks. Inverting the Gram matrix (with the added regularization term) is the analytical step that transforms raw interaction data into the learned relationships used for recommendations. It learns a set of item-to-item weights that, when applied to a user's past interactions, can "reconstruct" their interaction history and, more importantly, predict future interactions.
Now we've got out inverted matrix we can create our final item-item weight matrix B. The matrix B learns the influence of one item on another. The value B[i, j] i.e. any pair of products in a row with another in a column, represents how much item i's presence in a user's history should increase their predicted score for item j. At the same time we want to remove the ability of the model to arbitrarily recommend 'user bought item X so recommend them item X' so we set the diagonal of the matrix e.g. where i and j would pick out the same item, to 0. This forces the model only learns meaningful relationships between different items.
With our final item-item similarity matrix B ready, making predictions is simple. We perform a dot product between the original user-item matrix (in our case the train_val_matrix or X) and B. The result, preds, is a new matrix of the same dimensions as X. However, instead of binary 1s and 0s, it's filled with predicted scores. For a user 'u' and an item 'i' (X[u, i] = 0), the new value preds[u, i] is the model's predicted preference score. This score is calculated by taking all the items the user has interacted with and summing their learned similarity weights from matrix B. For example, if we were recommending films, a user's predicted score for 'Top Gun: Maverick' would be a weighted sum of their interactions with other movies, like 'Mission: Impossible' and 'The Edge of Tomorrow', based on the learned similarities in B. The recommendations with the highest scores are then presented to the user.
The predictions of X.dot(B) can be seen as a reconstruction of the user-item interactions based on the learned item-item relationships. EASE aims to find an item-item similarity matrix B that, when multiplied by the original user-item interaction matrix X, effectively reconstructs a version of user preferences. In essence, EASE learns a global map of how all items relate to each other based on collective user behaviour. This map, B, is then used to project a user's known preferences onto the entire item catalogue to find new, relevant items.
Let's get the predictions for the first customer in our data and compare them to their previous purchases to see if they make sense.
We can see that there's quite a lot of t-shirts in the recommendations. Let's compare these to what the customer actually bought to see if they seems like a good fit for them.
There look to be a few t-shirts in their historical data which is reassuring although the customers also seems to buy quite a few pairs of Slacks but didn't receive any recommendations for these. We'll see later how we can tune the _lambda parameter to try and improve our recommendations. We can calculate precision @ 10 on our customer to see how many of our recommendations they actually went on to buy. In the below example it looks like we successfully predicted 1 item out of the 10 i.e. precison @ 10 of 10% with the customer buying the 'Wow printed tee 6.99'
Creating a model class for EASE
So far we've seen how we can create our EASE model with just a few lines of numpy. Sometimes though it can be helpful to package all the different transformations into a model class where we can add some extra comments and quality of life improvements. For example, calculating predictions for all customers and all items in one go can be memory intensive. Instead, since all we're doing is the dot product between matrices we could create a process to do this in batches. In the previous example we removed recommendations for items that customers had already purchased but sometimes we might want to leave these in. The model class below incorporates some of these flexibilities.
Now we've created our model class we can instantiate it and then call it on our data. By default, the model uses a _lambda value of 0.5, scores in batches of 10,000 and removes preciously interacted with items.
We can see that our model scored an average precision at 10 of just 0.009 which isn't very high. This could be because the problem of predicting what fashion items a customer is going to buy next is just very difficult with changes trends and seasonality. Or it might be that our model isn't very good. One way to check this is to use a sensible baseline model that can give us some indication of whether the problem is difficult or whether we need to revisit our model. A simple and often surprisingly effective baseline model in recommendation tasks is to simply recommend the top selling or most popular products to users.
Although not really personalised recommendations, the fact that these products are high selling overall usually means there is something desirable about them and they have broad appeal to all or most customers and can be a difficult benchmark to beat. Let's create a function that works out what the top selling products in the training period were and then recommends them to customers whilst removing any they're already purchased.
We can see that just recommending the next best selling product to our customers from the training period would actually have made for a better model! Popularity models can be surprisingly hard to beat but let's see if we can improve the performance of our EASE model with some hyperparameter tuning.
Hyperparameter tuning
The fact that EASE only has one hyperparameter and is generally a lot quicker to train than other models means we can try out lots of values of lambda_ to try and improve our model. A small value of lambda (like the 0.5 we used in our initial model) places very little emphasis on the regularization term. The model will focus almost entirely on minimising the reconstruction error. This can lead to overfitting, where the model learns the training data too well and performs poorly on new, unseen data. The resulting matrix B might have large, specific values that are not generalisable. Large lambda (e.g., 100,000) place a strong penalty on the Frobenius norm of B. The model will then prioritise keeping the values in B small, even at the cost of a higher reconstruction error. This can lead to underfitting, where the model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and test sets.
The below code uses Optuna to try and find the optimal value of lambda_ from a large search space. The hope is that by giving Optuna a large space to learn from and starting it off with different lambda_ magnitudes it can find the "sweet spot" that balances having a model that's large enough to prevent overfitting but small enough to allow the model to learn meaningful item-item relationships. Optuna's job is to find this optimal value by systematically testing different lambda values and evaluating their performance on the validation set but instead of manually trying different lambda_ values, we let Optuna pick for us.
The objective function defines the "experiment" for a single trial. The lambda_ = trial.suggest_float('lambda_', 0.1, 100_000) tells Optuna to suggest values of lambda_ ranging from 0.1 up to 100,000. Optuna uses sophisticated algorithms to intelligently choose new values for lambda_ in each trial, aiming to find the best one more efficiently than a simple grid search. We then run the model with that value of lambda_ and measure and record its performance.
The optuna.create_study(direction='maximize') creates an Optuna "study," which is essentially a container for all the trials. The direction='maximize' tells Optuna that our goal is to find the set of hyperparameters (in this case, just lambda_) that results in the highest possible value for the returned score from the objective function. The study.enqueue_trial(...) below this manually adds a set of initial trials to the study. This is a form of warm-start, providing Optuna with a few initial data points to help it start its search more effectively. The study will run these trials first before using its own sampling algorithms.
The study.optimize(objective, n_trials=25) starts the optimisation process. This tells Optuna to call the objective function 25 times so feel free to set this to something smaller or larger depending on how quickly your earlier models ran. In each trial Optuna will:
-
Propose a new value for lambda_
-
Execute the objective function with that lambda_ i.e. train our model and measure its performance
-
Record the returned score.
-
Based on the results of previous trials, Optuna's Sampler (by default, the Tree-structured Parzen Estimator, TPE) tries to intelligently choose the next lambda_ to try. The idea is it explores promising regions of the hyperparameter space more thoroughly and moves away from values that led to poor performance than if we were just using a random or brute force grid search.
We can extract the best value of our target metric and the hyperparameter combination that lead to it from our study object. It looks like our best study had a lambda_ of 10625.6330 managed to outperform the top sellers baseline with a value of 0.0177 (beating the top sellers 0.0151). That represents a relative increase in performance of 94% (nearly twice as good!) over our original model. One concern with such detailed hyperparameter tuning is that potentially our model is now too well optimised for the validation set. We can test this by retraining our final model on our separate train and test set to see if performance is comparable even when using a completely distinct time period.
So on our new time period the model actually performs even better! Although it's always nice to see a performance increase on the test set, ideally we want it to be similar to our validation set to give us confidence that our validation process is robust. We saw earlier that there were fewer customers in train compared to train_val due to the different time periods covered by the data. It might well be that the train-test split time periods are easier for the model to predict due to seasonal changes. To test this we can see how our baseline popularity model performs. If it also shows a relatively similar improvement (but is still similarly outperformed by our new model) then that would lend evidence to the change in performance being due to the data rather than a failure of our process.
We can see that the top sellers baseline has also seen a boost in performance which suggests it is the time period the model is predicting for is a slightly easier one. Both the EASE model and the top sellers see their relative performances increase by about 35-40% between validation and test but the EASE model still outperforms our top sellers baseline which is reassuring. Now we've seen how we can train and improve our model to make recommendations, we can explore some of the other useful properties of the EASE model.
Item-item recommendations and predicting for new(ish) users
As we've seen, the core of the EASE recommender is the item-item weight matrix, B. This matrix holds the key to understanding the relationships and influences between different items in the dataset. Unlike other collaborative filtering models (see the tutorial on LightFM) that also need to learn user representations, EASE only learns item-item relationships. This gives it a couple of useful attributes that we'll explore now. For example, we can analyse the matrix to understand relationships between products. Let's dig into what the rows and columns in the matrix represent and how we can use them beyond just making recommendations.
The matrix B is a square matrix where both rows and columns correspond to the items in our dataset. However, unlike a simple co-occurrence matrix, it isn't symmetrical as the rows and columns capture slightly different representations of our items. Let's introduce a bit of notation to make keeping track of what's being referred to (hopefully) a bit easier
-
j will be the row index. This means we are looking at the row associated with item j. This row, B[j,:], describes the influence that item j has on all other items.If a user has interacted with item j, this row tells us which other items they are most likely to interact with next. We can think of this as the "successors" or "recommendations" from item j. To find the items most strongly associated with a given item j, we can look for the largest values in the j-th row of B. This is the classic "People who liked this also liked that" scenario.
-
k is the column index. This means we are looking at the column associated with item k. This column, B[:,k], describes the influence that all other items have on item k. This answers the question: "Which items, if liked by a user, most strongly predict that they will like item k?" We can think of this as the "precursors" or "defining items" for item k. To understand which items act as the strongest predictors for item k, we would examine the largest values in the k-th column.
-
A specific value i.e. B[j,k] is found at the intersection of row j and column k, and it quantifies the strength of the relationship where item j is the predictor and item k is the predicted item. This influence is determined during the model's training process.
We can use these attributes of the item-item weight matrix to create a 'similar items' lookup function. We can pass in a product and get from the matrix what the top recommended ones would be i.e. find the top values in the j, the rows of the matrix B, that corresponds to our candidate product.
To get the function to work we can pass in a itemID which in our case is just an item name. We can then get its mapped index which will tell us the row index in B that the item corresponds to. Now we have our index, we pass it to the get_similar_items() function.
The similarities = ease.B[item_id, :] accesses the row of the B matrix corresponding to the item of interest. This row contains the influence scores from the original item to all other items in the dataset. The np.argsort(similarities)[::-1] gets the indices of the items with the highest influence scores. It then excludes the item itself and takes the top n_items. Finally, the function uses the inv_item_mappings dictionary to decode the integer indices (item_id and those in similar_items_mapped) back to their original itemID (the product name).
What we're essentially doing here is passing our product as if it were a user with just 1 interacted item and asking EASE to make predictions for them. We can extend this concept, that our item is the equivalent of a user with only 1 item interaction, to understand how EASE interprets users and makes predictions for them.
Unlike matrix factorisation models that learn embeddings for specific users and so need to be retrained when new users enter the data, a user for EASE is nothing but a collection of interacted products. The actual user ID doesn't feature in the model training or prediction. Treating a user as just a collection of product interactions allows us to create predictions for new users who have interacted with a few items already without having to retrain the entire model like we would for a matrix factorisation model. The below function creates a dummy user-interaction history for an arbitrary set of products and then gets EASE to generate recommendations for them. It effectively extends the similar item function but now passes in a set of products rather than one at a time.
The ability of EASE to generate recommendations agnostic of a userID is actually a very helpful property. We could even go as far as to save our item-item weight matrix as a static object and then use it to generate predictions for new users or sets of item interactions on an ongoing basis. The only time we'd need to retrain our model (apart from wanting to keep it up-to-date on the latest data) is when new products launched.
Visualising item-item relationships with UMAP
In the previous examples we were using the rows of B to understand how buying a product or collection of products determines what other products we'd recommend. Now we can use the columns of our B matrix to understand which items are similarly composed as each other. The core idea is to treat each column of the B matrix as a high-dimensional vector representing an item's "embedding." By using the column of B this time, rather than the row, each vector is a collection of influence scores to the item. This means the vector essentially acts as a "fingerprint" for the item, defined by which other items are strong precursors or predictors for it. Items that share similar predictive patterns will have similar column vectors, meaning they are likely to be related in the model's eyes.
While the B matrix is great for generating recommendations, its high dimensionality (one dimension for every item) can make it difficult to directly interpret. Visualizing this matrix can provide insights into how the model perceives the relationships between different items. One way to do this is by using a dimensionality reduction technique like UMAP, which stands for Uniform Manifold Approximation and Projection.
UMAP is a dimensionality reduction technique that first builds a "fuzzy simplicial complex" to represent the high-dimensional data. This complex is essentially a weighted graph where each data point is a node and the edges connect neighbouring points. The weight of an edge represents the "likelihood" that the two points are connected. A key aspect of this step is that UMAP uses a variable radius for each point, determined by the distance to its nearest neighbours. This allows the algorithm to adapt to varying densities in the data, giving it a powerful ability to preserve local structure. After the high-dimensional graph is constructed, UMAP tries to find a low-dimensional graph (in 2D or 3D) that is as similar as possible to the original high-dimensional graph. It does this by using a form of optimization, similar to a force-directed graph layout algorithm. It tries to "pull" the connected points together and "push" the unconnected points apart, all while preserving the relationships established in the first step.
The reason for running on the columns, rather than the rows, is that this view is powerful for understanding EASE's internal definition of an item. For example, if two different types of T-shirts are both frequently purchased by users who have previously bought "jeans" and "trainers," their column vectors will be similar, and they will appear close together on the UMAP plot. This is a robust way to group items based on their shared context or dependencies within the model. If we wanted to we could also run UMAP on the rows of B which map the proximity of items that lead to similar recommendations i.e. items that have a similar "successor profile." For now, let's run our UMAP on the columns of B to visualise what products have a similar make up.
First, we initialise the UMAP algorithm, configuring it to map our high-dimensional item embeddings down to two dimensions suitable for a scatter plot. We get the model embeddings from our fitted ease object asking for ease.B.T i..e the columns of EASE which provides the unique "fingerprint" for each item. After running UMAP, we merge it with our product metadata to link each point to its index_group_name (e.g., 'Menswear', 'Ladieswear'). Finally, we use matplotlib to generate the scatter plot, where each point is an item, its position is determined by UMAP, and its colour corresponds to its product department. This plot allows us to visually inspect the clusters and confirm whether our EASE model has successfully learned to group similar types of products together based on their shared purchasing contexts.

The UMAP code tries brings our conceptual understanding to life by creating a 2D visualization of the item relationships learned by EASE. We can see that a lot of the items overlap but a couple of interesting groupings stand out. For example, Sport products seem to form their own grey cluster at the bottom as well as the green Divided category in the bottom left.
Summary
Congratulations on making it to the end of this tutorial! Hopefully you've now got a better idea of why EASE is an 'embarrassingly shallow' autoencoder and how its clever closed-form solution allows it to give state of the art recommendations in just a few lines of numpy. We've also looked at how we can use the learnt item-item weight matrix, B, to generate recommendations for arbitrary sets of item interactions and also visualise item-item relationships within our data. Let's finish with a quick re-cap of the strengths and weaknesses of the algorithm. If you want to learn more about other recommender models such as matrix factorisation there's a tutorial on lightfm available here.
Strengths of EASE
-
Simplicity and Speed: EASE has a direct, closed-form solution. Unlike iterative methods, it requires no complex gradient descent or optimization loops. Training is exceptionally fast for datasets with a moderate number of items.
-
State-of-the-Art Results: Despite its simplicity, EASE often performs on par with or even surpasses more complex matrix factorization and deep learning models on standard item recommendation benchmarks.
-
Minimal Hyperparameter Tuning: There are no learning rates, embedding dimensions, or training epochs to worry about. The only crucial hyperparameter is the regularization term lambda, which simplifies the modelling process immensely.
-
No user embeddings: As EASE is an item-item model it doesn't need to learn separate embeddings for customers so it can create predictions for new users who have a few interactions without having to retrain the entire model.
Weaknesses of EASE
-
Memory Scalability: The biggest challenge for EASE is scalability. The model needs to create and invert an item-item co-occurrence matrix of size (number_of_items x number_of_items). For catalogues with hundreds of thousands or millions of items, the memory and computational requirements for this step can be prohibitive.
-
Inability to Use Side Information: EASE is a pure collaborative filtering model. It cannot natively incorporate metadata (side information) like user demographics or item attributes (e.g., product category, price, text descriptions). This limits its effectiveness in cold-start scenarios where interaction data is sparse.

