Recommender Systems – The personalization technology of the future.
How to design, build and evaluate it.
Recommender systems are one of the most widely applied Machine Learning techniques nowadays. Plenty of our customers have requested us to build one; ranging from gaming companies to broadcasting giants, personalization technology is vital when trying to better serve your target group.
For a recent project at Boerderij magazine, I worked on the design and construction of one. In this blog, I will go more in-depth about what it takes to build a recommender system. Starting from the data, models, and packages to the initial evaluation and AB-test results. So, put on your data science hats and let’s get started!
Boerderij magazine and project scope
If you are not a farmer, you’ve probably never heard of Boerderij magazine. But for everyone working in the industry, Boerderij is the trusted go-to news source for information. The vast majority of Dutch farmers depends on their information to be informed and make the best decisions for their business.
Although they have a print magazine as well, the focus of this project are the digital channels and more specifically their newsletter. Their digital channels receive significant volumes. Which is perfect, because the more data you have, the better your recommender model will be.
For the purpose of this project, we focused on articles shown in the newsletter, specifically the background articles. Compared to regular news articles, the background articles stay relevant for a longer period of time. Therefore, they are well suited for using as recommendations. The main goal of this project was to improve the click-through-rate of those articles in the newsletter.
Any recommender system requires a specific type of data to work, called interaction data (i.e. which user interacted with which piece of content). Interaction data comes in two flavors – direct feedback data and indirect feedback data. In an ideal world we would have direct feedback, such as a like or a rating on a specific item. But as any content creator can tell you, this type of feedback is usually low or non-existent, so I wasn’t surprised that this was the case with Boerderij as well.
In such cases feedback from users has to be determined indirectly. Sometimes you have to become very creative in how you establish to a certain degree that someone liked an article. Things like time spend on a page, percentage scrolled on an article, and bounce rate can get you started but might not help you get the full picture. In this case, I had only click data available, luckily this already proved to be sufficient.
And finally, before moving to the results section, it’s important to introduce the term ‘cold-start problem’, which is a common challenge for recommender systems. It refers to the fact that recommender systems need historical data (interactions), either from users (user cold-start) or items (item cold-start) to enable recommendations. Depending on the type of recommender system used, you either have item or user cold start problems, or both.
In order to have a fair evaluation of different recommender system models, without too much influence of the user cold start problem, data from users with at least 5 interaction over the last year was selected for this project.
With that filter in applied, I had around 25.000 active users and 5000 background articles which had together a total of over 3 million unique interactions. For each of the articles, I also had the text available, which would allow for content-based recommendations. Furthermore, some user data was available as well. It was not taken into consideration for this project, but it might definitely be useful for a hybrid recommender system model in the future.
The “offline evaluation” challenge
Once I had the data to train a recommender system, the second real challenge started – finding a way to evaluate it offline. A notoriously challenging task. In an ideal world, an offline evaluation can be avoided. This, however, is impossible if you want to optimize and compare the performance of different models before taking it into production.
The problem with an offline evaluation is that you can’t directly measure the effect on the KPI’s you care about, in this case – the click-through-rate.
Relevance depends on many factors, such as the “freshness” of an article, variety, and persistence of the recommendations. These factors influence the click-through-rate; however, the extent is difficult to measure in an offline environment. So, although offline evaluation should be approached with skepticism, there are some measures that give an indication of how well the different models are performing.
The method used for this project is a variation on recall@K. It can be computed as follows:
- For each user take each article in the test set
- Sample 100 random non-relevant articles (assumption: non-read = non-relevant)
- Rank the list of 100 non-relevant + 1 relevant articles with a model
- Check if the article is present (1) or is not present (0) in the top-K recommendations
- Calculate the average of test set
This measure gives the likelihood that a relevant article will occur in the top-K recommendations with an addition of 100 non-relevant articles. In other words, when doing random recommendations, the recall@5 would be around 5%, and the recall@10 around 10%. I personally favor this measurement because it gives a decent indication of the model’s performance, but it’s also easy to explain and interpret.
Last but not least, it provides a way to compare the performance of the model with different users, or types of users. This is not possible with the original definition of recall@K. In the coming sections, the performance of each model is provided using this evaluation measurement, which I will simply refer to as recall@K.
Popularity ‘model’ benchmark
You may think that making recommendations based on popularity alone can’t be rightly called a model, and I would agree with you. Nonetheless, this method generally proves to be rather effective, mainly due to the so called pareto-principle. Meaning that around 20% of the articles generate 80% of the interactions. For this project, I did not plan on actually productionizing this approach, I merely use it as a benchmark. If other models did not perform better than the popularity ‘model’, one could seriously doubt whether they are worth implementing.
In this case the popularity ‘model’ recommended items that are popular but are not yet read by a user. This resulted in a recall@5 from 27% and a recall@10 from 40%; way better than simply sending random recommendations.
|Evaluation benchmark||Little / no personalization||recall@5 = 27%|
|No user cold start||Item cold-start||recall@10 = 40%|
Now that we figured out the baseline performance from the popularity ‘model’, it was time to move to a more advanced recommender system – a content-based model. This model recommends articles that fit a user’s profile based on earlier interactions. In order to do so, content features for each article need to be known.
Articles are basically text documents, so a straightforward term frequency–inverse document frequency (tf-idf) for the article text, title and introduction, is a good initial approach to compute the item features. I used a tf-idf implementation from scikit-learn: tfidfVectorizer. Once the user profiles were computed, the (cosine) distance between the user profile feature vector and articles feature vectors provide a measure of the relevance of each article for each user.
As this is a relatively straightforward approach, it did not require any specialized packages, other than functions in NumPy and scikit-learn for normalization and computing the cosine similarity. This model resulted in a recall@5 from 34% and a recall@10 from 47%. As I hoped, that is better than the popularity ‘model’ benchmark.
|No item cold-start||User-cold start||recall@5 = 34%|
|No recommendations outside the known area of interest||recall@10 = 47%|
Model-based collaborative filtering
The model I implemented next is a collaborative filtering model, which is probably the most widely used recommender system approach. A few different flavors of collaborative filtering models can be distinguished, but they all recommend articles that were relevant for users with similar behavior. In this case, I used model-based collaborative filtering, as it is more scalable and handles sparse data better than memory-based collaborative filtering. In model-based collaborative filtering, a matrix factorization method is used to decompose the interaction matrix with users, items and their interactions, into 2 smaller matrices. These smaller matrices contain the users and items with their computed latent features. Multiplying these two matrices gives us the score of each user for each item, also for the items for which previously no score was available.
As said, there are multiple approaches to matrix factorization. In this case I used a package called LightFM, which performs matrix vectorization with gradient descent. I chose this package, because it allows to extend the ‘regular’ collaborative filtering model to a hybrid recommender system by using additional user and item features. Although this option would not be used initially during this project, it could prove useful in the future when cold-start problems turn out to be an issue.
In our case, the performance of this model was even better than the content-based model, specifically the recall@5 was 54% and the recall@10 was 68%. Again, an improvement compared to the previous models.
|‘Surprising’ recommendations||User cold-start||recall@5 = 54%|
|Deals well with sparse data||Item cold-start||recall@10 = 68%|
|Scalable||Inference not explainable|
Time to see the actual results
Finally, it was time for the real thing, an actual AB(C)-test. This is crucial for the performance evaluation of a recommender system, because, as explained earlier, the offline evaluation results are limited and should always be approached with skepticism, as you do not directly measure the KPI’s you actually care about.
For the AB-test 12.632 active online users were selected and randomly divided into 3 groups, each with around 4.210 users.
|Random||4210||Random recommendations (within the farmer’s sector)|
|Popularity||4211||Recommendations based on popular articles (but not yet read)|
|Collaborative filtering||4211||Recommendations based on the collaborative filtering model|
Because there was no recommender system in place yet, the baseline click-through-rate for the new section of the newsletter was unknown. Therefore, we chose to add a group to the AB-test which would receive random recommendations, this could serve as the minimum click-through-rate to improve upon. For the second group, recommendations based on popularity were taken, this would serve as a benchmark on which we should definitely improve by using a more personalized approach. For the third group, the collaborative filtering model was used, as this turned out to be the best performing model in the offline evaluation. Altogether this AB-test could give a strong indication of what a recommender system can do for Boerderij magazine.
The results turned out to very positive! The collaborative filtering model performed more than three times as good as the random recommendations and over 30% better than the popularity model. Altogether, these results gave evidence that a recommender system is worth implementing for Boerderij magazine. This means starting out with getting the collaborative filtering model into production, after which further experiments could take place to improve the model.
Main future steps would be moving to a hybrid model, by making use of already known item and user features, which could largely overcome the item and user cold-start problems inherent to collaborative filtering.
By discussing the recommender system project for Boerderij magazine, I hope this blog shed some light on the processes of developing a recommendation engine, the challenges and hurdles that have to be overcome and ultimately, the benefits that can be achieved.