Background theory for recommendations

the background theory for matrix factorization-based collaborative filtering as applied to recommendation systems.

Collaborative filtering for recommendation systems

Collaborative filtering relies only on observed user behavior to make recommendations—no profile data or content access is necessary.

The basic assumption is that similar user behavior reflects similar fundamental preferences, allowing a recommendation engine to make suggestions accordingly.

For example, suppose User 1 has viewed items A, B, C, D, E, and F. User 2 has viewed items A, B, D, E and F, but not C. Because both users viewed five of the same six items, it’s likely that they share some basic preferences. User 1 liked item C, and it’s probable that User 2 would also like item C if the user were aware of its existence.

Matrix factorization for collaborative filtering

The collaborative filtering problem can be solved using matrix factorization.

Suppose you have a matrix consisting of user IDs and their interactions with your products. Each row corresponds to a unique user, and each column corresponds to an item. Each entry in the matrix captures a user’s rating or preference for a single item. The rating could be explicit, directly generated by user feedback, or it could be implicit, based on user purchases or time spent interacting with an article or video.

Ratings in the MovieLens dataset range from 1 to 5. Empty rating entries have value 0, meaning that a given user hasn’t rated the item.

Defining the matrix factorization method

A ratings matrix consists of a matrix $R$, where entries $r_{ij}$ are ratings of user $i$ for item $j$.

The matrix factorization method assumes that there is a set of attributes common to all items, with items differing in the degree to which they express these attributes. Furthermore, the matrix factorization method assumes that each user has their own expression for each of these attributes, independent of the items. In this way, a user’s item rating can be approximated by summing the user’s strength for each attribute weighted by the degree to which the item expresses this attribute. These attributes are sometimes called hidden or latent factors.

Intuitively, it’s easy to see that these hypothetical latent factors actually exist. In the case of movies, it’s clear that many users prefer certain genres, actors, or directors. Much of the power of the matrix factorization approach to collaborative filtering derives from the fact that it’s not necessary to know the number of hypothetical latent factors that might comprise the entirety of a given user’s preferences. It’s simply assumed there are an arbitrary number of hypothetical latent factors.

Transforming the matrix to represent latent factors

To translate the existence of latent factors into the matrix of ratings, you do this: for a set of users $U$ of size $u$ and items $I$ of size $i$, you pick an arbitrary number $k$ of latent factors and factorize the large matrix $R$ into two much smaller matrices $X$ (the “row factor”) and $Y$ (the “column factor”). Matrix $X$ has dimension $u × k$, and $Y$ has dimension $k × i$.

In linear algebra this is called a low-rank approximation. Every user is represented by a vector in this $k$-dimensional space, and each row $x_u$ in $X$ corresponds to the strength of the user’s preferences for these $k$ factors. Similarly, every item represented by a vector in this $k$-dimensional space has a column $y_i$ in $Y$ corresponding to the item’s expression of the same $k$ factors.

To calculate the predicted rating of user u for item i, you take the dot product of the two vectors: $$ r_{ui} = x_u^T . y_i $$ You can define a loss function measuring the accuracy of the approximation in the following way, sum the squared difference between the approximate rating $x_u^T⋅y_i$ and the actual rating from the training set $r_{ui}$.: $$ L=\sum_{u,i}(r_{ui}−x_u^T⋅y_i)^2 $$ It’s common practice to add regularization terms to this loss function to help prevent overfitting. Adding L2 regularization terms for both row and column factors gives the following: $$ L=\sum_{u,i}(r_{ui}−x_u^T⋅y_i)^2 + \lambda\sum_{u}||x_u||^2 + \lambda\sum_{i}||y_i||^2 $$ Here, $λ$ is a regularization constant, one of the model’s hyperparameters.

The WALS method of matrix factorization

Alternating least squares method (交替最小二乘法)

The alternating least squares method of matrix factorization is an iterative method for determining the optimal factors $X$ and $Y$ that best approximate $R$. In each iteration, one of the row or column factors is held fixed and the other is computed by minimizing the loss function with respect to the other factor.

First, you begin with the row factors:

  1. Set the column factors to constant values.
  2. Take the derivative of the loss function with respect to the row factors.
  3. Set the equation equal to zero.

$$ \frac {\partial{L}} {\partial{x_u}} = −2 \sum_i (r_{ui} − x_u^T⋅y_i)y_i^T + 2\lambda_{xu}^T $$

$$ 0 = −(r_u − x_u^TY^T)Y + \lambda x_u^T $$

$$ x_u^T(Y^TY + \lambda I) = r_uY $$

$$ x_u^T = r_uY(Y^TY + \lambda I)^{−1} $$

Alternating row and column factors, the iterative process is repeated until convergence.

Weighted alternating least squares (WALS) method (加权交替最小二乘法)

$$ L^w = {W} \sum_{u,i}(r_{ui}−x_u^T⋅y_i)^2 $$

In this equation:

  • $$w_{ui} = \omega_{0}$$ for zero (unobserved) entries in the ratings matrix

  • $$w_{ui} = \omega_{0} + f(c_{i})$$ for observed entries

  • $$c_{i} = \sum_{u,i}1 \text{ if } r_{ui} > 0$$ the sum of the number of nonzero entries for column $i$

The weight is scaled by the sum of the nonzero entries in a row to normalize the weight for users who have rated a different number of items. This type of weighting allows for a more flexible model of the user’s preferences and produces better empirical results than the unweighted version.

The function $f$ varies according to the dataset and whether ratings are explicit or implicit.

  • The MovieLens dataset contains explicit ratings. In this case, a better choice for $f$ is one that weighs the observed entries linearly:

    $$f = \omega_{k}c_{i}$$ where $\omega_{k}$ is the observed weight.

  • For implicit ratings related to dynamic events, where each rating corresponds to the number of times a video has been watched, an article read, or a web page viewed, the rating itself may have an exponential distribution due to user behavior. There will be many low-value ratings as users click on a page or video and navigate away quickly. There will be fewer large implicit ratings as users read an entire article, watch an entire video, or watch a given scheduled show multiple times. In this case, an appropriate $f$ is one that weights the observed ratings to account for this distribution:

    $f = \left(\frac{1}{c_{i}}\right)^{e}$ where $e$ is the feature weight exponent.

WALS compared to other techniques

Many matrix factorization techniques are used for collaborative filtering, including SVD and Stochastic Gradient Descent. In some cases these techniques give better reduced-rank approximations than WALS. It’s worth noting the following advantages of WALS:

  • The weights used in WALS make it suitable for implicit ratings.

  • WALS includes algorithmic optimizations that make it easy to incorporate weights and efficiently calculate row and column factor updates.

  • It’s relatively straightforward to create a distributed version of the algorithm.

Create the Model

Create Docker image environment

1
docker pull tensorflow/tensorflow:1.15.4-py3-jupyter
1
2
git clone https://github.com/GoogleCloudPlatform/tensorflow-recommendation-wals
cd tensorflow-recommendation-wals
1
vim Dockerfile
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Use an official tensorflow runtime as a parent image
FROM tensorflow/tensorflow:1.15.4-py3-jupyter

# Set the working directory to /app
WORKDIR /app

# Copy the requirement list into the container at /app
ADD requirements.txt /app

# Change mirrors and install vim
RUN sed -i "s@http://archive.ubuntu.com@http://mirrors.tuna.tsinghua.edu.cn@g" /etc/apt/sources.list && sed -i "s@http://security.ubuntu.com@http://mirrors.tuna.tsinghua.edu.cn@g" /etc/apt/sources.list && rm -Rf /var/lib/apt/lists/* && apt-get update && apt-get install -y vim

# Upgrading pip
RUN pip install -i https://pypi.tuna.tsinghua.edu.cn/simple pip -U

# set package index url
RUN pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple

# Install any needed packages specified in requirements.txt
RUN pip install -r requirements.txt

# Make port 8888 available to the world outside this container
EXPOSE 8888
1
2
docker build -t tfrec:latest --no-cache --network=host .
docker image ls

Download the MovieLens dataset

1
2
3
4
5
curl -O 'http://files.grouplens.org/datasets/movielens/ml-100k.zip'
umzip ml-100k.zip
mkdir /data/datasets/ml-100k
cp ml-100k/u.data /data/datasets/ml-100k/
rm -r ml-100k ml-100k.zip

Run Jupyter notebook

1
docker run -it --rm -v $PWD:/app -v /data/datasets/ml-100k:/datasets -w /app -p 8888:8888 tfrec:latest bash
1
jupyter notebook --ip 0.0.0.0 --no-browser --allow-root ./notebooks/Part1.ipynb

Train and tune the model

1
cd wals_ml_engine
1
docker run -it --rm -v $PWD:/app -v /data/datasets/ml-100k:/datasets -w /app tfrec:latest python -m trainer.task --job-dir=/app --train-file=/datasets/u.data --data-type=ratings --delimiter='\t'

Install miniconda

1
2
3
4
$ yay -S miniconda3
$ echo "export PATH=/opt/miniconda3/bin/:$PATH" >> ~/.bashrc
$ echo "[ -f /opt/miniconda3/etc/profile.d/conda.sh ] && source /opt/miniconda3/etc/profile.d/conda.sh" >> ~/.bashrc
$ source ~/.bashrc

Install the Python packages and TensorFlow.

1
2
3
$ conda create -y -n tfrec                     # -y:  Do not ask for confirmation, -n: Name of environment.
$ conda install -y -n tfrec --file conda.txt   # --file FILE: Read package versions from the given file
$ source activate tfrec
1
2
3
4
5
$ conda deactivate
$ conda remove -n tfrec --all
$ conda create -y -n tfrec python=3.7          # with a specific version of Python
$ pip install -r requirements.txt
$ pip install tensorflow==1.15

API and Predict

ensure that the application bind to 0.0.0.0 and not 127.0.0.1

1
docker run -it --rm -v $PWD:/app -v /data/datasets/ml-100k:/datasets -w /app tfrec:latest python -m trainer.task --job-dir=/app --train-file=/datasets/u.data --data-type=ratings --delimiter='\t'

Deploy the Recommendation System

Reference

  1. Building a Recommendation System in TensorFlow

  2. Recommendation Systems

  3. Recommender System with Matrix Factorization

  4. 如何在 TensorFlow 內建立推薦系統:總覽

  5. 有哪些好用的开源推荐系统?

  6. Recommendation Model

  7. Matrix Factorization in tensorflow 2.0 using WALS Method

reviewing https://github.com/tensorflow/models/tree/08bb9eb5ad79e6bceffc71aeea6af809cc78694b/official/recommendation for how to get started

I was able to dig around and find a very similar approach in the tutorials https://developers.google.com/machine-learning/recommendation/dnn/softmax

(props to whoever put those tutorials together was helpful getting another look at the approach and tradeoffs)

note for anyone else looking for example apps I bumped into a couple by Nvidia https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Recommendation/NCF https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Recommendation/NCF

cool overview for intel’s analytics zoo (office depot) across models MF, ncf, wide & deep, and session recommender https://software.intel.com/en-us/articles/real-time-product-recommendations-for-office-depot-using-apache-spark-and-analytics-zoo-on

it looks like Google’s official example leverages apache spark’s als model https://cloud.google.com/solutions/recommendations-using-machine-learning-on-compute-engine