# Matrix Factorization code not converging

``````import numpy as np

class MatrixFactorization:

def __init__(self, matrix, features_count, is_set_function, learning_rate=1e-3, regularization_factor=2e-2):
self.matrix = matrix
self.features_count = features_count
self.alpha = learning_rate
self.beta = regularization_factor
self.is_set = is_set_function

self.global_bias = self.compute_global_bias()
self.f1w = np.random.rand(matrix.shape[0], features_count) * self.global_bias  # factor 1 weight
self.f1b = np.zeros((matrix.shape[0]))  # factor 1 bias
self.f2w = np.random.rand(features_count, matrix.shape[1]) * self.global_bias  # factor 2 weight
self.f2b = np.zeros((matrix.shape[1]))  # factor 2 bias
self.samples = self.compute_samples()

def compute_global_bias(self):
n = 0
s = 0
for row in self.matrix:
for x in row:
if self.is_set(x):
n += 1
s += x
return s / n

def compute_samples(self):
samples = []
for i in range(self.matrix.shape[0]):
for j in range(self.matrix.shape[1]):
if self.is_set(self.matrix[i, j]):
samples.append((i, j, self.matrix[i, j]))
return samples

def shuffle(self):
np.random.shuffle(self.samples)

for i, j, p in self.samples:
prediction = self.predict(i, j)
error = prediction - self.matrix[i, j]

# Updating the biases, the global bias is a constant.
self.f1b -= self.alpha * (error - self.beta * self.f1b[i])
self.f2b -= self.alpha * (error - self.beta * self.f2b[j])

self.f1w[i, :] -= self.alpha * (error * self.f2w[:, j] - self.beta * self.f1w[i, :])
self.f2w[:, j] -= self.alpha * (error * self.f1w[i, :] - self.beta * self.f2w[:, j])

def cost(self):
return np.sum((self.construct_matrix() - self.matrix) ** 2)

def predict(self, i, j):
return self.f1w[i, :] @ self.f2w[:, j] + self.global_bias + self.f1b[i] + self.f2b[j]

def construct_matrix(self):
return self.f1w @ self.f2w + self.global_bias + self.f1b[:, np.newaxis] + self.f2b[np.newaxis, :]

def train(self, epsilon):
self.shuffle()
while True:
cost = self.cost()
print(cost)
if cost < epsilon:
break

def AlwaysTrue(x):
return True

def NonNegative(x):
return x >= 0

def set_random_indices(matrix, n, value):
indices = np.random.choice(np.arange(0, matrix.size, 1), n)
i = indices // matrix.shape[1]
j = indices % matrix.shape[0]
matrix[i, j] = value

def test(matrix, n_unknowns=5, learning_rate=0.0001, regularization_factor=0.01, epsilon=1e-2):
copy = np.copy(matrix)
set_random_indices(copy, n_unknowns, -1)

# Note that the error won't go beyond some
# point if the regularization factor is not 0
m = MatrixFactorization(copy, 2, NonNegative, learning_rate, regularization_factor)
m.train(epsilon)

print('\nFactor 1:')
print('W: \n', m.f1w)
print('B: \n', m.f1b)
print('\nFactor 2:')
print('W: \n', m.f2w)
print('B: \n', m.f2b)
print('\nGlobal Bias:', m.global_bias)
print('\nMatrix:')
print(matrix)
print('\nPrediction:')
prediction = m.construct_matrix()
print(prediction)
print('\nDifference:')
print(prediction - matrix)

if __name__ == '__main__':
matrix = np.array([[3, 1, 1, 3, 1],
[1, 2, 4, 1, 3],
[3, 1, 1, 3, 1],
[4, 3, 5, 4, 4]])
# test(matrix, 0)
test(matrix, 2)
``````

After learning about Collaborative Filtering, and Content-based Filtering from the course, I stumbled upon a technique called â€śMatrix Factorizationâ€ť.

Using stochastic gradient descent, I approximate latent features vector for both the users and the items (or whatever the 2 factors of the matrix are).

I have a test function that sets some random elements in the matrix to `-1`, which is equivalent to having the value not set (to be predicted).

For a 4 * 5 matrix, it works fine when there are no unknowns, but if I set only 2 unknowns, it does not converge. Iâ€™ve tried a lot of different hyperparameters, but it doesnâ€™t seem to work.

Iâ€™m wondering, is it a bug in the code that is causing this? Why is it not converging?

``````            if cost < epsilon:
break
``````

Hello @OsamaAhmad, perhaps you want to change the above lines from checking the value of cost into checking the change of costs?

Sure, but what I meant is that the error doesnâ€™t go any close to 0 (or epsilon = 1e-2).
Maybe the word â€śconvergeâ€ť isnâ€™t the best word to use here.

Yes, usually we speak about the change when it comes to converging.

``````
self.f1b -= self.alpha * (error - self.beta * self.f1b[i])
self.f2b -= self.alpha * (error - self.beta * self.f2b[j])

self.f1w[i, :] -= self.alpha * (error * self.f2w[:, j] - self.beta * self.f1w[i, :])
self.f2w[:, j] -= self.alpha * (error * self.f1w[i, :] - self.beta * self.f2w[:, j])
``````

why minus self.beta but not plus?

Thank you for spotting it, thatâ€™s a mistake

Also in these two lines the f1w and f2w are mixed up? Or should the errors be multiplied by feature values instead of feature weights?

Item at position (i, j) in the matrix = dot(f1w[i, :], f2w[:, j]) + biases
Derivative with respect to f1w = error * f2w + â€¦

how does it go now, converging?

No noticeable difference

What if you disable the regularization?

Also, I donâ€™t quite understand the purpose for `NonNegative` or `is_set` there? Would this do anything bad if one of matrixâ€™s element is negative? Can we remove this stuff?

I tried it, no difference as well

This is used to filter the unset values
As far as I know, there is no way of â€śunsettingâ€ť a value in a matrix.
So, I put a constraint, in this example, negative values are not allowed, thus, -1 is an invalid value, thus is considered to be not set.

The `is_set` function is responsible for determining whether a value is a valid value or not. If itâ€™s valid, the number is put in a list called `samples`, from which the stochastic gradient descent picks training examples.

Minor question: I am not familiar with the term â€śunset valuesâ€ť, under what context is it? I have doubt because if you are just factorizing a matrix, it shouldnâ€™t matter whether an element of the matrix is negative or not. is having the â€śunset checkâ€ť absolutely necessary at this stage?

Major question: should your cost also NOT account for â€śinvalidâ€ť elements?

1 Like

If we have a matrix representing ratings of users and movies, for example, not all movies will be rated by all users, in fact, most of the matrix will be unset, and only a small subset of it will have numbers from 0 to 5

To predict the ratings for the movies that are not yet rated (unset), the following is done:

• Each user i is represented with an embedding vector capturing some latent features `(vu_i)`
• Each item j is represented with an embedding vector capturing some latent features `(vm_j)`
• dot(vu_i, vm_j) should be = the rating of user i to movie j (or a good approximation)

Since not all values are set, we only look for the numbers that are set, and train on the model to compute vectors that will give a correct answer for the set ratings.

Using these embedding vectors, we can then predict the rating of user k to movie l by computing dot(vu_k, vm_l)

Having a global bias, a user bias for each user, and a movie bias for each movie should fasten the training process.

1 Like

Ohâ€¦ Thanks! This is the cause of the problem.

Itâ€™s an interesting way to â€śimputeâ€ť missing rating with matrix factorization. I believe there are much more works ahead:

1. I just ran it, and observe that the cost go down at first and then up when the matrix has a negative element. The fact that it went up can indicate that (please confirm this) as you train more, the imputed rating starts to deviate more from your initial desired rating. This may or may not be a problem:

1. This may not be a problem because your `matrix` is a pretty random one and thus pattern-less, so maybe we should not have a huge expectation at all in the first place.

2. If this may be a problem, you may want to change the convergence condition from checking the cost to checking the cost change, so that the cost canâ€™t go up.

2. Your cost takes the sum of squared errors of all elements, instead of taking the mean of them. Taking the mean is better because the mean value shouldnâ€™t be affected much by the size of your matrix.

3. Check if your imputed rating is within range of 0 - 5 or not. Perhaps you want to add some constraints to your gradient descent to make sure that happens, but obviously this will complicate things up.

4. If you wanna just do matrix factorization, then you may use any matrix to test your algorithm but I would not add the â€śvalid checkâ€ť. If you wanna do rating imputation, i would not use just any matrix to test my algorithm, instead i would probably use some real world data for that, because your imputation wonâ€™t make sense if your data is just any pattern-less matrix. I think you would agree if your matrix is just any matrix, the imputed rating can also be just any value and there is no point to check whether the imputed rating is good or not.

1 Like

The matrix is not random tho, I got it from this tutorial on matrix factorization. If I were to use a random matrix, I wouldâ€™ve generated it, with a bigger size. Note here that the first and the third rows are the same, which suggests that users 1 and 3 have the exact same taste (If weâ€™re going to assume that this is a rating matrix), also, notice that row 4 = row 1 + row 2.

1 - I donâ€™t know why that happens tbh, the matrix is not random, the only explanation that I can think of right now is that the learning rate is big, which makes the gradient descent overshoots.

2 - Thank you, I missed this.

3 - This doesnâ€™t assume anything, itâ€™s just a class for factorizing matrices, but sure, other constraints have to be added if this were to be a rating. In fact, other methods take a weighted sum of the ratings of similar users, then divide it by the sum of weights, to insure that the result wonâ€™t be bigger than 5.

4 - itâ€™s actually taken from a tutorial

Wellâ€¦ it may not be perfectly random, but it may not be perfectly informative as well.

You need to decide the quality of your dataset, for example, how did the data get collected? any errors in the data? how representative is your data?

These are questions I donâ€™t think you will get an answer from the tutorial, and frankly, 5 data rows canâ€™t be very informative. So when I say â€śrandomâ€ť, I donâ€™t mean you donâ€™t know the source of data, but way more than that including the questions above.