Matrix Factorization code not converging

import numpy as np


class MatrixFactorization:

    def __init__(self, matrix, features_count, is_set_function, learning_rate=1e-3, regularization_factor=2e-2):
        self.matrix = matrix
        self.features_count = features_count
        self.alpha = learning_rate
        self.beta = regularization_factor
        self.is_set = is_set_function

        self.global_bias = self.compute_global_bias()
        self.f1w = np.random.rand(matrix.shape[0], features_count) * self.global_bias  # factor 1 weight
        self.f1b = np.zeros((matrix.shape[0]))  # factor 1 bias
        self.f2w = np.random.rand(features_count, matrix.shape[1]) * self.global_bias  # factor 2 weight
        self.f2b = np.zeros((matrix.shape[1]))  # factor 2 bias
        self.samples = self.compute_samples()

    def compute_global_bias(self):
        n = 0
        s = 0
        for row in self.matrix:
            for x in row:
                if self.is_set(x):
                    n += 1
                    s += x
        return s / n

    def compute_samples(self):
        samples = []
        for i in range(self.matrix.shape[0]):
            for j in range(self.matrix.shape[1]):
                if self.is_set(self.matrix[i, j]):
                    samples.append((i, j, self.matrix[i, j]))
        return samples

    def shuffle(self):
        np.random.shuffle(self.samples)

    def stochastic_gradient_decent(self):

        for i, j, p in self.samples:
            prediction = self.predict(i, j)
            error = prediction - self.matrix[i, j]

            # Updating the biases, the global bias is a constant.
            self.f1b -= self.alpha * (error - self.beta * self.f1b[i])
            self.f2b -= self.alpha * (error - self.beta * self.f2b[j])

            self.f1w[i, :] -= self.alpha * (error * self.f2w[:, j] - self.beta * self.f1w[i, :])
            self.f2w[:, j] -= self.alpha * (error * self.f1w[i, :] - self.beta * self.f2w[:, j])

    def cost(self):
        return np.sum((self.construct_matrix() - self.matrix) ** 2)

    def predict(self, i, j):
        return self.f1w[i, :] @ self.f2w[:, j] + self.global_bias + self.f1b[i] + self.f2b[j]

    def construct_matrix(self):
        return self.f1w @ self.f2w + self.global_bias + self.f1b[:, np.newaxis] + self.f2b[np.newaxis, :]

    def train(self, epsilon):
        self.shuffle()
        while True:
            cost = self.cost()
            print(cost)
            if cost < epsilon:
                break
            self.stochastic_gradient_decent()


def AlwaysTrue(x):
    return True


def NonNegative(x):
    return x >= 0


def set_random_indices(matrix, n, value):
    indices = np.random.choice(np.arange(0, matrix.size, 1), n)
    i = indices // matrix.shape[1]
    j = indices % matrix.shape[0]
    matrix[i, j] = value


def test(matrix, n_unknowns=5, learning_rate=0.0001, regularization_factor=0.01, epsilon=1e-2):
    copy = np.copy(matrix)
    set_random_indices(copy, n_unknowns, -1)

    # Note that the error won't go beyond some
    # point if the regularization factor is not 0
    m = MatrixFactorization(copy, 2, NonNegative, learning_rate, regularization_factor)
    m.train(epsilon)

    print('\nFactor 1:')
    print('W: \n', m.f1w)
    print('B: \n', m.f1b)
    print('\nFactor 2:')
    print('W: \n', m.f2w)
    print('B: \n', m.f2b)
    print('\nGlobal Bias:', m.global_bias)
    print('\nMatrix:')
    print(matrix)
    print('\nPrediction:')
    prediction = m.construct_matrix()
    print(prediction)
    print('\nDifference:')
    print(prediction - matrix)


if __name__ == '__main__':
    matrix = np.array([[3, 1, 1, 3, 1],
                       [1, 2, 4, 1, 3],
                       [3, 1, 1, 3, 1],
                       [4, 3, 5, 4, 4]])
    # test(matrix, 0)
    test(matrix, 2)

After learning about Collaborative Filtering, and Content-based Filtering from the course, I stumbled upon a technique called “Matrix Factorization”.

Using stochastic gradient descent, I approximate latent features vector for both the users and the items (or whatever the 2 factors of the matrix are).

I have a test function that sets some random elements in the matrix to -1, which is equivalent to having the value not set (to be predicted).

For a 4 * 5 matrix, it works fine when there are no unknowns, but if I set only 2 unknowns, it does not converge. I’ve tried a lot of different hyperparameters, but it doesn’t seem to work.

I’m wondering, is it a bug in the code that is causing this? Why is it not converging?

            if cost < epsilon:
                break

Hello @OsamaAhmad, perhaps you want to change the above lines from checking the value of cost into checking the change of costs?

Sure, but what I meant is that the error doesn’t go any close to 0 (or epsilon = 1e-2).
Maybe the word “converge” isn’t the best word to use here.

Yes, usually we speak about the change when it comes to converging.


            self.f1b -= self.alpha * (error - self.beta * self.f1b[i])
            self.f2b -= self.alpha * (error - self.beta * self.f2b[j])

            self.f1w[i, :] -= self.alpha * (error * self.f2w[:, j] - self.beta * self.f1w[i, :])
            self.f2w[:, j] -= self.alpha * (error * self.f1w[i, :] - self.beta * self.f2w[:, j])

why minus self.beta but not plus?

Thank you for spotting it, that’s a mistake

Also in these two lines the f1w and f2w are mixed up? Or should the errors be multiplied by feature values instead of feature weights?

Item at position (i, j) in the matrix = dot(f1w[i, :], f2w[:, j]) + biases
Derivative with respect to f1w = error * f2w + …

how does it go now, converging?

No noticeable difference :confused:

What if you disable the regularization?

Also, I don’t quite understand the purpose for NonNegative or is_set there? Would this do anything bad if one of matrix’s element is negative? Can we remove this stuff?

I tried it, no difference as well :slightly_frowning_face:

This is used to filter the unset values
As far as I know, there is no way of “unsetting” a value in a matrix.
So, I put a constraint, in this example, negative values are not allowed, thus, -1 is an invalid value, thus is considered to be not set.

The is_set function is responsible for determining whether a value is a valid value or not. If it’s valid, the number is put in a list called samples, from which the stochastic gradient descent picks training examples.

Minor question: I am not familiar with the term “unset values”, under what context is it? I have doubt because if you are just factorizing a matrix, it shouldn’t matter whether an element of the matrix is negative or not. is having the “unset check” absolutely necessary at this stage?

Major question: should your cost also NOT account for “invalid” elements?

1 Like

If we have a matrix representing ratings of users and movies, for example, not all movies will be rated by all users, in fact, most of the matrix will be unset, and only a small subset of it will have numbers from 0 to 5

To predict the ratings for the movies that are not yet rated (unset), the following is done:

  • Each user i is represented with an embedding vector capturing some latent features (vu_i)
  • Each item j is represented with an embedding vector capturing some latent features (vm_j)
  • dot(vu_i, vm_j) should be = the rating of user i to movie j (or a good approximation)

Since not all values are set, we only look for the numbers that are set, and train on the model to compute vectors that will give a correct answer for the set ratings.

Using these embedding vectors, we can then predict the rating of user k to movie l by computing dot(vu_k, vm_l)

Having a global bias, a user bias for each user, and a movie bias for each movie should fasten the training process.

What about this? Your training focuses on only valid elements, the cost should also be the same, right? Not fair to include something your gradient descent doesn’t touch at all.

1 Like

Oh… Thanks! This is the cause of the problem.

It’s an interesting way to “impute” missing rating with matrix factorization. I believe there are much more works ahead:

  1. I just ran it, and observe that the cost go down at first and then up when the matrix has a negative element. The fact that it went up can indicate that (please confirm this) as you train more, the imputed rating starts to deviate more from your initial desired rating. This may or may not be a problem:

    1. This may not be a problem because your matrix is a pretty random one and thus pattern-less, so maybe we should not have a huge expectation at all in the first place.

    2. If this may be a problem, you may want to change the convergence condition from checking the cost to checking the cost change, so that the cost can’t go up.

  2. Your cost takes the sum of squared errors of all elements, instead of taking the mean of them. Taking the mean is better because the mean value shouldn’t be affected much by the size of your matrix.

  3. Check if your imputed rating is within range of 0 - 5 or not. Perhaps you want to add some constraints to your gradient descent to make sure that happens, but obviously this will complicate things up.

  4. If you wanna just do matrix factorization, then you may use any matrix to test your algorithm but I would not add the “valid check”. If you wanna do rating imputation, i would not use just any matrix to test my algorithm, instead i would probably use some real world data for that, because your imputation won’t make sense if your data is just any pattern-less matrix. I think you would agree if your matrix is just any matrix, the imputed rating can also be just any value and there is no point to check whether the imputed rating is good or not.

1 Like

Thank you for your effort!

The matrix is not random tho, I got it from this tutorial on matrix factorization. If I were to use a random matrix, I would’ve generated it, with a bigger size. Note here that the first and the third rows are the same, which suggests that users 1 and 3 have the exact same taste (If we’re going to assume that this is a rating matrix), also, notice that row 4 = row 1 + row 2.

1 - I don’t know why that happens tbh, the matrix is not random, the only explanation that I can think of right now is that the learning rate is big, which makes the gradient descent overshoots.

2 - Thank you, I missed this.

3 - This doesn’t assume anything, it’s just a class for factorizing matrices, but sure, other constraints have to be added if this were to be a rating. In fact, other methods take a weighted sum of the ratings of similar users, then divide it by the sum of weights, to insure that the result won’t be bigger than 5.

4 - it’s actually taken from a tutorial

Well… it may not be perfectly random, but it may not be perfectly informative as well.

You need to decide the quality of your dataset, for example, how did the data get collected? any errors in the data? how representative is your data?

These are questions I don’t think you will get an answer from the tutorial, and frankly, 5 data rows can’t be very informative. So when I say “random”, I don’t mean you don’t know the source of data, but way more than that including the questions above.