Item and User KNN Recommender models with similaripy in Python

In this post we're going to see how we can build KNN recommender models in Python using the similaripy package. We'll learn some of the theory behind different similarity measures as well as some graph based methods that yield state of the at results.

How do KNN recommender models work?

K-Nearest Neighbours (KNN) recommenders are a memory-based collaborative filtering family of models. They make recommendations by finding other users or items whose interaction histories are similar to a target user or item, and then use those neighbours to produce scores or ranked suggestions (it works on the basis of “people who liked X also liked Y”). Because KNN relies only on patterns of interactions (not on item or user metadata), it’s a collaborative approach as opposed to content based methods that use item/user attributes (e.g., movie genre or user demographics).

User–item interactions are typically represented as a matrix R where rows are users, columns are items, and entries R[u,i] record interactions (a purchase, rating, view, etc.). These matrices are highly sparse as most users will only have interacted with a tiny fraction of items. It's not unusual that out of all the possible user-item interactions in the matrix for over 99% missing!

The core idea behind KNN is to find the "K" most similar data points in this sparse matrix to a given data point. In the context of a recommender system, this translates to finding the K most similar users or items. The model then uses these neighbours to generate recommendations. There are two main approaches:

User-based KNN: This approach finds users who are similar to the target user based on their past ratings or behaviour (based on their rows in R). For example, if User A and User B have rated many of the same movies similarly, they are considered neighbours. The model would then recommend movies to User A that User B liked but User A hasn't seen yet.
Item-based KNN: This approach finds the K items most similar (based on columns in R) to each item the target user interacted with, then recommend items similar to the user’s known items.. For instance, if a user enjoys a particular action movie, the model would find other action movies that are often rated highly by people who also liked the initial movie. It then recommends those similar movies.

Item-based methods are often preferred in large production systems because there tend to be fewer items to calculate similarities between than the number of customers. It also means we can easily compute recommendations for new customers as soon as they have a few interactions without having to retrain the whole model. There's not a huge difference in the code or methodology between the models so we'll start with item KNN and then at the end see how we can quickly apply everything we've learnt to create user KNN models.

The "similarity" between users or items is calculated using a distance or similarity metric. Cosine similarity is commonly used for recommender systems because it's less affected by differences in user rating scales. It measures the cosine of the angle between two vectors which in our case are the rows or columns in our user-item interaction matrix depending on whether we're finding similar uses or items. In this way, user or items that have similar interaction histories will have similar vectors and so when we come to measure them, we'd expect them to be similar/close together. It's particularly effective for high-dimensional data like user-item rating matrices. We'll use a few different types of distance and similarity metric including some graph methods that give state of the art performance.

Once the K-nearest neighbours are identified, the model can generate recommendations in a few ways. The model can predict how the target user would rate a new item by taking the weighted average of the neighbours' ratings for that item. The ratings of closer neighbours might be given more weight. Alternatively, the model can rank items that the neighbours liked but the target user hasn't seen. The highest-ranked items (based on how many neighbours liked them or their average rating) are then recommended.

Although, KNN models can work with continuous targets e.g. predict the average rating out of 5 a user might give a product, we'll be working with binary (1/0s) and implicit feedback data. Implicit feedback means our model learns from feedback that is derived indirectly from user behaviour and interactions, without the user explicitly stating their preferences. For example, we might take the fact a user clicked on a product or watched a video as them implicitly showing a preference for it rather than them explicitly telling us so by leaving a 5 star review for it. Implicit feedback is inferred from actions rather than direct statements. In his 2010 paper 'Training and testing of recommender systems on data missing not at random' Harald Steck highlighted the advantages of using implicit data with the insight that users often don't interact with items they dislike. This allows us to use the absence of interactions to infer user preferences i.e. we can infer what they do like from items they have interacted with but also what they dislike from what they have avoided.

If we solely rely on users explicitly telling us they like or don't like something then if a user hasn't given a rating for a product we can't infer any further information which severely limits our ability to learn what they dislike. In the paper Steck shows that the "absence of ratings carries useful information for improving the top-k hit rate concerning all items" which gives implicit feedback an edge when it comes to best capturing user preferences. On top of using implicit feedback, EASE also works best when the data is encoded in a binary fashion i.e. 1/0 to represent if a user did/didn't interact with an item.

If the EASE tutorial covered my favourite recommender algorithm, this one uses my favourite recommender package similaripy. It enables the creation of very powerful KNN models for user/items with a range of different similarity measures, including some state of the art graph methods, all wrapped up in blazingly fast Cython code.

Data Preparation

Let's go ahead and read in our data and prep it ready for modelling, then we can create our first KNN model to understand more about how it works under the hood. We'll be using H&M fashion data from this Kaggle competition. The data is very large so we'll only keep the columns we need and take some extra steps to shrink it down to something manageable. Some of the ID columns are stored as very long hashed string values that take up a lot of memory so we'll replace them with integer mappings using sklearn. One of the main challenges with KNN models is that they can be memory intensive so anything we can do upfront to decrease the size of our training data is helpful.

Looking at the 'articles' data frame it looks like we've got a couple of different ways of identifying products. The 'article_id' looks like a unique identifier for a product but there's also the 'prod_name' field that looks like it's the name/description of the product although it doesn't look to be unique. We can run a count on the number of unique occurrences in both variables to see if this is the case.

To make the results a bit more interpretable it'd be useful to use the 'prod_name' for our model which would also mean we can shrink down the purchase data down a bit more. The difference in number of 'prod_name' and 'article_id' could possibly be when it's essentially the same product (so same name) but in a different colour or size which might require it to get its own article_id. For our purposes, we'll take a purchase of any 'Strap top' as an interaction with that product rather than recording interactions with every unique Strap top article ID. We'll also rename 'prod_name' to 'itemID' to make it a bit more generic.

Above is a print out of our order data. We can see that we've got t_dat which is our date column, customer_id which is a very long hashed key and our newly created itemID. Our KNN models only need a userID and itemID to train from as it doesn't care about the order or sequence in which customers interacted with items. We'll use the t_dat column to split our data into a training, validation and test set later.

Finally we'll do a bit more data prep by replacing the long hashed string value of customer_id with a simple integer index. We'll also shrink our training data by removing any repeat purchases of items by customers, keeping only the first time someone bought the product type. Since we'll be working with binary interaction data i.e. 1/0s we don't need to worry about if a customer has bought something more than once. We'll also filter to only use the latest 18 months of data in the table to arbitrarily reduce the size of it whilst giving us a long enough time period to train and predict on.

The original competition supplied over 2 years' worth of data and had a prediction window of just one week. For this tutorial we'll use a 6 month training window with a 6 month prediction window. This is to give most customers who are in our prediction set the chance to actually shop again and so we can measure our performance on lots of customers rather than the handful that shop in any given week. In a real world setting we'd likely choose a shorter window based on our retraining and prediction schedule i.e. retrain weekly to predict the next 7 days.

We'll actually take a chronological train, validation and test split. We'll train our initial model on the oldest 6 months of data and tune the hyperparameters to get the prediction on the next 6 months as our validation data. Finally, we'll retrain the model with the best hyperparameters from our validation period before predicting on the last 6 months of data as our test period.

As KNN, like all collaborative filtering models, can't handle cold-start users or items (those that are in the prediction period but not in the training period) we'll need to remove them from our prediction sets. As we later want to use the validation set as our training data we actually need two separate copies of the validation period - one with cold start customers and items removed when its being used as our validation set and one where they're included when it's the training period for our test set. To make things a bit clearer I've given the splits below the following names that correspond to how they get used in the workflow.

train_val = the oldest 6 months of the data. It provides the training data for the train-validation process.
valid = the 6 months after train_val that are its prediction period. It will have customers and items not in train_val removed.
train = the same 6 months as valid but with all customers and items included. It is the data we'll train our final model on once we've found the best hyperparameters and we'll create predictions for the test set.
test = the latest 6 months of data available. Will be used to measure how well the final model performs. Will have customers and items that aren't in the train set removed.

Let's make these data sets and drop the original 'orders' table to free up some more memory.

Even with restricting ourselves to just 6 months of training data, we still have nearly 8 million rows of data! To save our time and CPUs we can apply an additional filter that removes the infrequent customers or incredibly niche products that customers are unlikely to buy. Whilst not something you would want to do in a real-world setting as you'd lose coverage of what you can recommend and to whom, it's quite a common practice in the literature on recommenders. In a review of different recommender papers, Sun et al (2020), found that over half the papers they analysed employed filtering of items and users for a minimum number of interactions. We'll be doing it with the aim of shrinking the data to a practical size though rather than for any theoretical benefits I'm applying quite an aggressive minimum filter of 20+ for items and users in the data set.

We can see after filtering that we still have ~2.96 million rows in our triain_val data and ~1.5 million in train so the number still aren't tiny!. Now we know which customers and items have passed our minimum interaction requirements, we can now go about removing our cold start items and users from our prediction periods. We'll also run some quick stats on the different periods to check the dates, number of users and items.

We can see that our start and end date between the periods are consecutive and don't overlap. It looks like we're missing a few days in our valid/train period which is probably likely due to stores being shut over the Christmas period. Interestingly, despite using the same 6 months and 20 interaction filter for train_val and train we have a lot fewer customers and rows in the train period which suggests customers don't buy as many clothes over winter as they do in the summer!

Now we've got our data prepped and filtered, we can convert our pandas data frames into the csr_matrices that similaripy works with. The csr_matrix from scipy efficiently stores sparse matrices by storing only the non-zero elements, along with their row and column indices. This gives us significant memory savings when dealing with matrices containing mostly zeros.

The row and column indices are the key to working with the matrices as they allow us to access the elements and provide a mapping back to our original data. However this can get a bit tricky when switching between our train and test sets which might have different numbers of users and items in them. The easiest way round this is to create a mapping between all our users and items and then construct our csr_matrix from these. This way we ensure the mapping from our original data to the csr_matrix is consistent and we don't lose any memory as only the non-zero values are stored. Let's create some mappings and a function to create our csr matrices.

We can see that even though we had different numbers of customers in train_val and valid their csr equivalents are actually the same shape. This is because we need to create rows and columns for every user and item we've ever seen to preserve the unique index mapping i.e. we know that row 0 always refers to user 0 across all matrices we create even if they didn't shop in any specific time period. It simply leaves those values 0/blank if the mapping doesn't find its respective customer or user in the data that we pass it.

We'll also define a 'precision @ k' scoring function to help us assess how well our recommendations are performing. Precision at k measures for each customer, how many of the k-number of predictions we made did the customer actually go on to interact with. It's a metric commonly used in information retrieval and recommendation systems to evaluate the accuracy of a ranked list of predictions.

Item KNN models with similaripy

Now we've got our data prepped, we can build our first model. In an item-based KNN model, we find items that are most similar to each item in our catalogue, based on historical user–item interactions. For a target user, we look at the items they’ve interacted with, fetch their nearest neighbours (most similar items), and then recommend those neighbours the user hasn’t seen yet.

similaripy is a high-performance Python package for computing similarity matrices on large sparse datasets. It’s built in Cython with multi-threaded computation via OpenMP, which makes it much faster than pure Python implementations. It also supports a variety of classic similarity measures (cosine, Jaccard, etc.) and graph-based methods (P3α, RP3β) that often give state-of-the-art performance. It allows us to specify lots of different KNN models along with their hyperparameters in a very standardised way which makes building, tuning and testing lots of models very easy.

First, we define our model by picking a similarity function and its parameters. We compute the item–item similarity matrix using the transposed user–item interaction matrix. Remember that in our user–item matrix, rows = users, columns = items. When we want item–item similarity, each item needs to be represented as a row vector of its interactions across all users i.e. each row = one item and each column = users that have interacted with it. This means we can calculate the similarity it is for how similar items are to each other based on their column vectors of which customers interacted with them. If we wanted to do user-user similarity the only difference would be to leave our original matrix un-transposed.

To generate predictions we simply multiply the user–item matrix by the item–item similarity matrix. We can also use a similaripy helper function to filter out items the user has already interacted with (filter_cols=) so we don’t recommend repeats. Most similarity functions in similaripy share the following hyperparameters:

k: number of top neighbours to keep per row (for item–item similarity, this is the maximum number of similar items stored per item).
shrink: shrinkage regularization term to reduce the impact of items with few co-occurrences. Higher shrinkage pushes similarities toward zero when data is sparse.
threshold: minimum similarity value to retain. Similaripy discards similarities below this value by setting them to zero. It essentially alows us another way to remove top-k items that might not actually be that good for calculating recommendations from.
binary: binarizes the input matrix before computing similarities. Important for implicit data if you want to treat multiple interactions as a single “1”. We've already taken care of this with our data prep so can ignore it for now.
format_output: 'coo' or 'csr' for sparse output format. On Windows, only 'coo' is supported so I'll be doing a lot of csr conversions you might not need to.

Let's run through each of the different similarity measures that similaripy offers now. We'll set some global parameters for shrink and k so our model results are comparable.

Dot Product

Our first similarity measure isn't strictly a similarity measure but can provide a useful benchmark for other models. The dot product calculates the raw inner product between two item vectors. For our binary data, this is just the count of users who interacted with both items i.e. a co-occurrence matrix. It’s simple, fast, but biased toward popular items because they tend to have larger dot products. For fashion, a popularity based model could work quite well so let's give it a try.

As discussed, we construct our “model,” which is really just a learnt item–item similarity matrix (or in this case a co-occurrence matrix), by calling sim.dot_product() on our transposed user–item interaction matrix. We tell similaripy how many K nearest neighbours we want to return the similarities of, with the option to also apply shrinkage.

To make predictions, we simply take the dot product of our new item–item matrix (with its top-K scores per item) against the original user–item interaction matrix.We then tell similaripy the number k of predictions we want for each user. Finally, we can use the filter_cols= option to pass in our original matrix so that similaripy filters out any items the user has already purchased from the recommendations. Note that the prediction dot_product() needs the similarity matrix transposed so the multiplication lines up to produce user–item scores. Similaripy’s dot_product() expects rows in the second argument to represent the items we’re multiplying against. By transposing model_dot_prod, we're aligning it so that each column of the transposed similarity matrix corresponds to the item we’re predicting for.

Cosine Similarity

Measures the cosine of the angle between two vectors i.e. the user-item interaction vector the items being compared. It normalises by vector length, so it’s less biased toward popular items than dot product. For binary implicit data, this becomes the co-occurrence count divided by the geometric mean of individual item interaction counts.

Jaccard Similarity

Intersection (how many elements are in both sets) over union (how many unique elements are in either set). It usually works well with binary data, Jaccard focuses on the proportion of shared users rather than their counts. It tends to favour niche items with concentrated overlap.

Dice Similarity

The harmonic mean of set sizes. It's similar to Jaccard but intersection (how many elements are in both sets) is weighted twice. The denominator is the sum of the set sizes instead of the union size. This makes Dice a little more forgiving for smaller sets — if two niche items have a few users in common, Dice will score them higher than Jaccard. This can make it handy for finding similarity between less-popular items where a small overlap is still meaningful.

Tversky Index

The Tversky Index is a sort of customisable version of Jaccard and Dice. It has two parameters, alpha and beta which control our preference for false positive vs false negatives. For example, let's imagine we're comparing two sets, A and B, in a context where we're trying to predict B from A.

True Positive (TP): Items present in both sets. This is the intersection, ∣A∩B∣.
False Positive (FP): Items in set A but not in B. This is the set difference, ∣A−B∣. These are the "extra" items in our prediction (A) that aren't in the ground truth (B).
False Negative (FN): Items in set B but not in A. This is the set difference, ∣B−A∣. These are the items you "missed" in our prediction (A) that were present in the ground truth (B).

The values of alpha and beta allow us to bias the similarity score to favour one type of error over the other. Alpha controls the penalty for false positives (FP). A larger alpha penalises false positives more heavily, which is useful when you want to ensure that our predicted set (A) doesn't contain too many items that aren't in the target set (B). This is a good way to improve precision. Beta controls the penalty for false negatives (FN). A larger beta penalises false negatives more heavily. This is beneficial when it's critical to capture all the items in the target set (B), even if it means including some extra items in our predicted set (A). This is a good way to improve recall.

This means Tversky can bias similarity:

If α < β, you favour cases where A’s users are almost entirely in B.
If α > β, you favour cases where B’s users are almost entirely in A.

It can be good for asymmetric similarity scenarios e.g., “people who bought niche item A also bought popular item B,” but not necessarily the reverse. The Tversky Index is a customisable version of Jaccard and Dice as setting specific values of alpha and beta make it equivalent to Jaccard or Dice.

α = β = 1 → Jaccard
α = β = 0.5 → Dice

S-Plus

S-plus is a hybrid similarity measure that blends two different approaches:

Tversky similarity (a generalization of Jaccard/Dice — intersection over a weighted union)
Cosine similarity (dot product of normalized vectors, measuring angular closeness)

It combines them with a tunable balance parameter l (lambda ) and extra shape/weight parameters for each component.

Parameters:

l (λ): balance between Tversky and Cosine parts
t1, t2: α and β for the Tversky component
c: exponent for Cosine weighting

P3α

Is the first of the graph-based methods. Instead of only looking at direct co-occurrence between items (like Cosine or Jaccard), it models the recommendation problem as a random walk on the user–item bipartite graph. One set of nodes = users and the other set = items. The edges = interactions (e.g., a purchase, rating, or click). It then performs a 3-step random walk:

Step 1: Start at an item node.
Step 2: Move to all users who interacted with that item.
Step 3: Move to other items those users interacted with.

This "item → user → item" process connects the starting item to related items through shared users, but also captures indirect connections (items connected through different users). Each row of the adjacency matrix is normalized to sum to 1 (L1 normalisation) so it behaves like a probability distribution. The alpha parameter controls the bias toward strong vs. weak connections:

Lower α (e.g., 0.3) → smooths probabilities, giving weaker links more weight (more diversity).
Higher α (e.g., 0.9) → sharpens probabilities, focusing on the strongest item–item connections (more accuracy but less diversity).

The graph-based method is powerful as it captures higher-order co-occurrence. For example, if two items are not directly related but are connected through a series of users and other items, P3α can still link them. For example, if two niche books are never bought together but are both bought by users who also buy a certain popular author, P3α can connect them indirectly. A pure cosine similarity might miss this because it only looks at direct overlaps. Direct similarity measures also tend to recommend very popular, obvious items. P3α’s multi-hop nature brings in related but less obvious items, improving recommendation variety.

RP3β

RP3β is a direct extension of P3α. It keeps the same 3-step random walk on the user–item graph, but adds a popularity penalisation term to reduce bias toward the most interacted-with items.

β = 0: No popularity penalty (identical to P3α)
β = 0.5: Moderate penalty; popular items are dampened but not excluded
β = 1: Strong penalty; heavily favors niche, long-tail items

Many recommenders over-recommend blockbuster items. RP3β fixes this by giving smaller, less popular items a chance to appear. Anelli et al (2022) found it provided state of the art performance, often only outperformed by EASE in their experiments. The alpha and beta parameters also let us control the trade off between different aspects of the recommendation task. For example, alpha controls precision vs. diversity whilst β controls popularity bias vs. novelty.

It looks like out of the 8 methods used the original dot product performed best! Recommending popular items is likely to be a good strategy for fashion recommendations however we've seen that a lot of the metrics have quite a high number of parameters that need tuning. There is also one fundamental hyperparameter in KNN that for now we have ignore - the actual value of K nearest neighbours we ask the model to use when creating predictions. For now we've just been using a value of 5 i.e. telling similaripy to create recommendations using the 5 most similar items to each of the ones a user has bought before. It might be that this is too few or too many for some of the similarity measures. Let's now turn to how we can tune this K parameters, along with all the other ones to try and create even better models.

Hyperparameter tuning

Unlike EASE, which only has a single hyperparameter, most Similaripy models have two or more hyperparameters that can influence performance in different ways. This makes hyperparameter tuning more complex, but also more important.

The core parameters that appear in most Similaripy KNN-based methods are:

k — The number of nearest neighbours to keep when computing the similarity matrix. Small values focus on the most similar items but may miss useful weaker connections. Large values can increase recall but may introduce noise.
shrink — A regularization term applied to similarity values. A small shrink means similarities are computed almost purely from the data, which can overfit to rare interactions. A large shrink smooths similarity scores toward zero, reducing overfitting but possibly losing nuance.

In addition, some models have method-specific hyperparameters:

P3α and RP3β:
- α controls the sharpness of probability transitions (low α → smoother, high α → peakier).
- β (RP3β only) penalizes popular items to encourage long-tail recommendations.
Tversky:
- α and β control asymmetry in set overlap weighting.
S-Plus:
- l, t1, t2, c blend Tversky-like and Cosine-like components.

The challenge is that these parameters interact — e.g., a high k might require a higher shrink to avoid overfitting, or lowering α in P3α could pair well with a stronger popularity penalty in RP3β. The good news is that these models are usually very fast to train so we can try many combinations of parameters. The code below uses Optuna to explore a wide hyperparameter space for all 8 models. Instead of manually testing k, shrink, α, β, etc., we let Optuna systematically search the space and pick the combinations that maximize validation performance. The goal is to find the "sweet spot" where:

The model is complex enough to capture meaningful higher-order or indirect item–item relationships.
But regularized enough to avoid overfitting to noise or rare co-occurrences.

By starting Optuna’s search from different magnitudes (e.g., small vs large k, strong vs weak shrink), we give it a better chance to find the optimal balance for each model.

The objective function defines the "experiment" for a single trial. It implements conditional hyperparameter optimisation using Optuna as not all the models have the same hyperparameters to tune. The Universal Parameters are suggested first, common parameters used by all models (K for number of neighbours and shrink_reg for regularization). Next we select which similarity measure to use from the 8 different options. Depending on which is picked, the conditional hyperparameters and their search space is defined. This conditional approach is highly efficient because:

It avoids suggesting irrelevant parameters for each algorithm
Reduces the search space dimensionality
Allows each algorithm to be tuned with its specific parameter ranges
Optuna's tree-structured Parzen estimator can learn which algorithms perform better and focus sampling accordingly

Optuna uses sophisticated algorithms to intelligently choose what hyperparameter combinations to use in each trial, aiming to find the best one more efficiently than a simple grid search. The optuna.create_study(direction='maximize') creates an Optuna "study," which is essentially a container for all the trials. The direction='maximize' tells Optuna that our goal is to find the set of hyperparameters that results in the highest possible value for the returned score from the objective function. The study.optimize(objective, n_trials=100) starts the optimisation process. This tells Optuna to call the objective function 100 times so feel free to set this to something smaller or larger depending on how quickly your earlier models ran. In each trial Optuna will:

Propose a set of hyperparameters
Execute the objective function with those parameters i.e. train our model and measure its performance
Record the returned score.
Based on the results of previous trials, Optuna's Sampler (by default, the Tree-structured Parzen Estimator, TPE) tries to intelligently choose the next lambda_ to try. The idea is it explores promising regions of the hyperparameter space more thoroughly and moves away from values that led to poor performance than if we were just using a random or brute force grid search.

Once it's finished running, we can extract the best value of our target metric and the hyperparameter combination that lead to it from our study object. In our case, it looks like the best precision@10 was 0.01689 which came from s_plus with hyperparameters of {'K': 95, 'shrink_reg': 100, 'splus_l': 0.1, 'splus_t1': 0.9, 'splus_t2': 1.0, 'splus_c': 0.0}. This is a lot better than the 0.012681 our original dot product achieved. We can break down the significance of what the different hyperparameter values tell us about our model and our data:

K: 95 - Very High Neighbourhood Size

Uses 95 nearest neighbours for recommendations
This is quite high, suggesting the model benefits from considering many similar items
May help capture diverse similarity patterns but could introduce noise

shrink_reg: 100 - Maximum Shrinkage Regularization

Applies the strongest possible regularization (100 out of 100)
Heavily penalises items with few interactions
This aggressive shrinkage suggests our dataset is sparse and needs strong regularization to avoid overfitting to rare items

splus_l: 0.1 - Heavy Tversky Weighting

λ = 0.1 means the final similarity is 10% Tversky + (1-0.1) × c × Cosine
With splus_c: 0.0, this becomes purely Tversky-based
Very low lambda suggests Tversky similarity works better than Cosine for our data

splus_t1: 0.9, splus_t2: 1.0 - Asymmetric Tversky Parameters

t1 (α) = 0.9: Moderate penalty for false negatives (items in set A but not B)
t2 (β) = 1.0: Full penalty for false positives (items in set B but not A)
This asymmetry suggests the model is more tolerant of missing common items than of including items that aren't truly similar

splus_c: 0.0 - No Cosine Component

The cosine similarity component is completely eliminated
The model effectively becomes: S-Plus = 0.1 × Tversky(0.9, 1.0)
This suggests cosine similarity hurts performance on our dataset

In summary, our optimal S-Plus model is essentially a heavily regularized, asymmetric Tversky similarity with:

Sparse data handling: Maximum shrinkage suggests very sparse interactions
Conservative similarity: High β penalizes false positives more than false negatives
Broad neighbourhood: Large K suggests the model needs many neighbours to make good predictions
Pure Tversky approach: Cosine similarity is detrimental and completely removed

This configuration suggests our recommendation task benefits from a conservative, highly regularised approach that's very careful about false similarities, likely due to sparse user-item interactions in your dataset.

One concern with such detailed hyperparameter tuning is that potentially our model is now too well optimised for the validation set. We can test this by retraining our final model on our separate train and test set to see if performance is comparable even when using a completely distinct time period.

So on our new time period the model actually performs even better! Although it's always nice to see a performance increase on the test set, ideally we want it to be similar to our validation set to give us confidence that our validation process is robust. We saw earlier that there were fewer customers in train compared to train_val due to the different time periods covered by the data. It might well be that the train-test split time periods are easier for the model to predict due to seasonal changes.

Item-item recommendations and predicting for new(ish) users

Unlike collaborative filtering models that learn separate user and item representations, such as matrix factorization (an example of which is available in the tutorial on ALS), item KNN models focus exclusively on learning item-to-item relationships. The resulting item-item similarity matrix contains valuable information that extends beyond making recommendations. For example, we can analyse the matrix to understand relationships between products.

Remember that our item KNN model creates recommendation by finding the most similar K items to our target ones. How similaripy implements this is it calculates all item-item similarities and then 0s out any item similarities outside of the top k most similar. This is great for making predictions as we're not including any low or poor similarities in our recommendations. However if we want to understand the structure of the relationships between all products in our data we need to create our item-item similarity matrix to include the similarities scores between all products. We can do this my simply passing in a value of K equal to the number of products in our data. The below example shows the difference in using our best model from Optuna with its K=95 vs the same model but passing in a K = number of items in our data.

We can see from the output that our 'best model' with it's k=95 constraint has1,487,035 stored elements i.e. values that are not 0. This is exactly what we'd expect as 15,653 x 95 = 1,487,035 i.e. number of items in our data x number of K similar items. These top 95 are what allows our model to have the best predictive power in terms of recommending new items to users. We can see in our second output where we told similaripy to use all items by setting K=number of items in our data that some of the previously 0 values are now populated and we have 245m elements populated which is 15,653 x 15,653. We can now use this larger similarity matrix between all items to get our item-item similarities.

Most similarity measures (e.g. Cosine, Jaccard, and Dice) produce symmetric matrices, where the similarity between item A and item B equals the similarity between item B and item A. However, some measures (e.g. Tversky ) can create asymmetric matrices which makes understanding the interpretation of the matrix rows versus columns important to understand what relationships are captured by each. Let's introduce a bit of notation to make keeping track of what's being referred to (hopefully) a bit easier.

B will be our item-item similarity matrix learnt during the modelling process.
i will be the row index. . This means we are looking at the row associated with item i in our matrix B. This row, B[i,:], describes the influence that item i has on all other items. If a user has interacted with item i, this row tells us which other items they are most likely to interact with next. We can think of this as the "successors" or "recommendations" from item i. To find the items most strongly associated with a given item i, we can look for the largest values in the i-th row of B. This is the classic "People who liked this also liked that" scenario.
j is the column index. This means we are looking at the column associated with item j in our matrix B. This column, B[:,j], describes the influence that all other items have on item j. This answers the question: "Which items, if liked by a user, most strongly predict that they will like item j?" We can think of this as the "precursors" or "defining items" for item j. To understand which items act as the strongest predictors for item j, we would examine the largest values in the j-th column.

For symmetric metrics like Cosine, Jaccard, and Dice, the matrix has identical values at positions (i,j) and (j,i). In these cases rows and columns are interchangeable: The row for item i contains the same values as the column for item i. This means there is a single interpretation of the item-item relationship, whether we examine a row or column, we're looking at the same similarity relationships. For these measures, the distinction between "recommendations" and "embeddings" doesn't apply as they're the same.

For asymmetric metrics like Tversky (when α ≠ β), the matrix values differ across the diagonal, creating distinct meanings for rows and columns. Rows represent recommendations, looking at row i shows which items are most similar to item i, answering "users who liked this item also liked what?" Whereas columns represent embeddings, looking at column j shows which items are most similar with item j as the target, answering "which items best predict that users will like item j?"

We can use these attributes of the item-item weight matrix to create a 'similar items' lookup function. Using our similar items matrix with all the calculated similarities, we can pass in a product and find what the top recommended items for it would be i.e. find the top values in the row, i, of the matrix B.

We can see that for our candidate product 'Isa Cardigan' the top 5 recommended items are other types of cardigan which makes sense. What we're essentially doing here is passing our product as if it were a user with just 1 interacted item and using our item-item matrix to make predictions for them (whilst considering all possible items). We can extend this concept, that our item is the equivalent of a user with only 1 item interaction, to understand how we might be able to use our previously trained item-item weight matrix to make predictions for new customers.

Unlike matrix factorisation models that learn embeddings for specific users and so need to be retrained when new users enter the data, our item KNN model just takes in a user-item interaction row to make predictions. The actual user ID doesn't feature in the model training or prediction. Treating a user as just a collection of product interactions allows us to create predictions for new users who have interacted with a few items already without having to retrain the entire model like we would for a matrix factorisation model. The below function creates a dummy user-interaction history for an arbitrary set of products and then uses our previously trained best model to generate recommendations for them. It effectively extends the similar item function but now passes in a set of products rather than one at a time and reapplies the K=95 constraints learnt by our model.

The ability of our item KNN to generate recommendations agnostic of a userID is actually a very helpful property. We could even go as far as to save our item-item weight matrix as a static object and then use it to generate predictions for new users or sets of item interactions on an ongoing basis. The only time we'd need to retrain our model (apart from wanting to keep it up-to-date on the latest data) is when new products launched.

Visualising item-item relationships with UMAP

In the previous examples we were using the rows of our item-item similarity matrix to understand how buying a product or collection of products determines what other products we'd recommend. Now we can use the columns of our matrix to understand which items are similarly composed as each other. The core idea is to treat each column of our item-item matrix as a high-dimensional vector representing an item's "embedding." By using the column of the matrix this time, rather than the row, each vector is a collection of influence scores to the item. This means the vector essentially acts as a "fingerprint" for the item, defined by which other items are strong precursors or predictors for it. Items that share similar predictive patterns will have similar column vectors, meaning they are likely to be related in the model's eyes.

While our item-item similarity matrix matrix is great for generating recommendations, its high dimensionality (one dimension for every item) can make it difficult to directly interpret. Visualizing this matrix can provide insights into how the model perceives the relationships between different items. One way to do this is by using a dimensionality reduction technique like UMAP, which stands for Uniform Manifold Approximation and Projection.

UMAP is a dimensionality reduction technique that first builds a "fuzzy simplicial complex" to represent the high-dimensional data. This complex is essentially a weighted graph where each data point is a node and the edges connect neighbouring points. The weight of an edge represents the "likelihood" that the two points are connected. A key aspect of this step is that UMAP uses a variable radius for each point, determined by the distance to its nearest neighbours. This allows the algorithm to adapt to varying densities in the data, giving it a powerful ability to preserve local structure. After the high-dimensional graph is constructed, UMAP tries to find a low-dimensional graph (in 2D or 3D) that is as similar as possible to the original high-dimensional graph. It does this by using a form of optimization, similar to a force-directed graph layout algorithm. It tries to "pull" the connected points together and "push" the unconnected points apart, all while preserving the relationships established in the first step.

The reason for running on the columns, rather than the rows, is that this view is useful for understanding how our item KNN model stores its internal definition of an item. For example, if two different types of T-shirts are both frequently purchased by users who have previously bought "jeans" and "trainers," their column vectors will be similar, and they will appear close together on the UMAP plot. This is a robust way to group items based on their shared context or dependencies within the model. If we wanted to we could also run UMAP on the rows of B which map the proximity of items that lead to similar recommendations i.e. items that have a similar "successor profile." For now, let's run our UMAP on the columns of B to visualise what products have a similar make up.

First, we initialise the UMAP algorithm, configuring it to map our high-dimensional item embeddings down to two dimensions suitable for a scatter plot. We get the model embeddings from our fitted ease object asking for item_similarities.T i..e the columns of the matrix which provides the unique "fingerprint" for each item. After running UMAP, we merge it with our product metadata to link each point to its index_group_name (e.g., 'Menswear', 'Ladieswear'). Finally, we use matplotlib to generate the scatter plot, where each point is an item, its position is determined by UMAP, and its colour corresponds to its product department. This plot allows us to visually inspect the clusters and confirm whether our item KNN model has successfully learned to group similar types of products together based on their shared purchasing contexts.

The UMAP code tries brings our conceptual understanding to life by creating a 2D visualization of the item relationships learned by our model. We can see that a lot of the items overlap but a couple of interesting groupings stand out. For example, Sport products seem to form their own grey cluster at the top as well as a few red Baby/Children product clusters. Menswear also seems to form its own cluster with a few grey Sport products in the bottom right.

User KNN

Up until now we've been focussing on the item KNN models but we can easily apply everything we've learn to user KNN models too. One of the nice properties of the user–item interaction matrix is that item KNN and user KNN are essentially the same algorithm jus twith the roles swapped.

Item–KNN: we transpose the matrix so each row is an item vector (columns = users).
User–KNN: we leave the matrix as–is so each row is a user vector (columns = items).

This means all the similarity functions and hyperparameter tuning strategies we applied for items can be applied directly to users — we’re just measuring user–user similarity instead of item–item similarity.

For example, remember our initial item KNN model just using the dot_product function? Well here it is calculating a user KNN model.

The only difference is we star with our user-item interaction matrix in its original orientation (rows = users, columns = items). We then calculate our user-user similarity matrix by comparing each user’s interaction vector to every other user’s vector using a chosen similarity metric such as cosine, Jaccard, dot product, or one of the more advanced graph-based measures. The result is a square matrix (user x user in shape) where each entry represents the similarity between two users. In practice, we keep only the top K most similar users for each target user, discarding the rest to keep the model focused and efficient.

We can see that once again our similarity matrix is square but this time its dimensions are the number of users in our data set. This can be one of the main challenges of user KNN models as typically there a lot more users than items so the matrix and the similarity computations can become quite large.

Once the nearest neighbours are identified, the recommender predicts scores for all items the target user has not yet seen. For each candidate item, it looks at which of the similar users have interacted with it and then aggregates those interactions, weighting each one by the similarity between the target user and that neighbour. In effect, items that are liked by many highly similar users will get higher scores, while items liked only by distant neighbours will rank lower.

Before producing the final recommendations, the model filters out any items the user has already interacted with (using the filter_cols in similaripy), ensuring that only novel items are suggested. The remaining items are then ranked in descending order of their predicted scores, and the top N become the recommendations. Conceptually, the process can be thought of as finding “people like you” and then asking, “What have they enjoyed that you haven’t experienced yet?” — a simple idea that can still be surprisingly powerful in practice.

For hyperparameter tuning, we can run exactly the same Optuna search we used for item–KNN, just swapping in the untransposed train_val_matrix. The objective function below tries multiple similarity algorithms and parameter combinations to find the best performing user–KNN recommender:

Switching between item KNN and user KNN is as simple as choosing whether to transpose the user–item interaction matrix before computing similarities. The rest i.e. similarity functions, shrinkage, k–nearest neighbours, hyperparameter tuning, etc. stays exactly the same. Often the choice comes down to practicalities. For example, item KNN models can be quicker to train as usually there are fewer items or fewer new items each week than there are users. This can make it easier to train and cache models.

For now, let's calculate the performance of our final user KNN model on the test set and see how it compares to the item KNN model we created earlier.

It looks like out of the two approach the user KNN model works best with a precision@10 of 0.025 compared to the item KNN model of 0.023.

Summary

In this tutorial, we've explored how to build KNN-based recommender systems using the similaripy package, covering both item KNN and user KNN variants. We began by discussing the underlying concept of KNN recommenders - representing either items or users as vectors in an interaction space, then finding their nearest neighbours based on a chosen similarity function. We experimented with a wide range of similarity measures from classic metrics like cosine and Jaccard to more advanced, graph-based methods like P3α, RP3β, and S+. We saw how each one changes the way “neighbourhood” relationships are defined and can impact the diversity and relevance of recommendations.

We then introduced hyperparameter tuning with Optuna, showing how KNN models depend on parameters such as K (number of neighbors), shrinkage (regularization), and method-specific tuning constants (e.g., α, β, λ). For each similarity metric, Optuna searched across these parameters to find the configuration that maximized precision on a validation set. This automated approach allowed us to cover large parameter spaces efficiently without manual trial-and-error.

Next, we extracted item–item similarity matrices from the best-performing models and extended the computation beyond the top K neighbors to get full similarity data for every item pair. We then used these matrices to derive item embeddings, applying UMAP for dimensionality reduction and visualization. This gave us a clear, interpretable map of how items are positioned in similarity space, often revealing category clusters and cross-category relationships that the model has learned.

Finally, we shifted from item KNN to user KNN, demonstrating that the conversion is as simple as deciding whether to transpose the user–item interaction matrix before computing similarities. While item KNN focuses on finding similar products and recommending them to users who have shown interest in related items, user KNN focuses on finding similar people and recommending what those people have enjoyed.

Item KNN is typically more stable over time because item similarity patterns change relatively slowly. Once an item’s relationships are learned, they can be reused for every user, making scoring fast as it's just a simple dot product between the user’s profile and the item similarity matrix. This method works especially well in domains where items have rich co-purchase or co-consumption patterns and where catalogue turnover isn’t extreme. However, it may struggle with brand-new items that have few or no interactions (the item cold-start problem).

User KNN, in contrast, can adapt quickly to changing user behaviour since recommendations depend on the most similar users right now. It can work well in cases with highly dynamic preferences but suffers when there are many more users than items, as computing and storing a large user–user similarity matrix becomes expensive. It also faces a user cold-start problem as new users without history have no meaningful neighbourhood to draw from.

Both KNN methods differ from matrix factorization approaches (like ALS or BPR) in that they do not learn latent embeddings optimised for a loss function. Instead, they work directly with the observed interaction patterns, making them easy to interpret and quick to update without retraining from scratch. Factorisation methods, however, can capture hidden structure in the data, discovering subtle relationships even when explicit co-occurrence is sparse. This can give them an advantage over KNN methods in terms of accuracy, particularly for long-tail items.

In practice, KNN models excel when interpretability, simplicity, and incremental updating are important, while factorisation methods shine when dense latent relationships and global structure are key. Many production systems combine the two, using KNN for explainable or cold-start-friendly recommendations and factorisation for high-accuracy personalised ranking. If you'd like to learn more about factorisation methods there's an ALS tutorial here and a LightFM one here. If you'd like to learn about a 'principle neighbourhood method' EASE there's a tutorial here.

KNN

Tuning

UMAP