Hi, I have a problem with the implementation of the collaborative filtering on mllatestsmall dataset (MovieLens  GroupLens). I split the data into train and test sets and try algorithm from C3_W2_Collaborative_RecSys_Assignment. The results I checked for one user and for the test set are far from decent.
I’m completely stuck. Did I incompetently split the dataset or should I change something in the implementation? I am sending the notebook I was working on as an attachment.
recomender_test.ipynb (186.7 KB)
I found a problem with splitting the dataset. The correct approach to this problem is:
([RecSys] Implementation on Variants of SVDBased Recommender System  by Tom Lin  Towards Data Science)
I am sending the revised version as an attachment. The code seems to work, but I’m still not sure if this is the correct approach to the problem.
If anyone would like to review it, I would greatly appreciate criticism.
recomender_test.ipynb (142.4 KB)
Hello @paweldro,
Great try!
The full dataset has 25M ratings of 62k movies + 162k users. You included only 101k ratings of 9.7k movies + 610 users.
In other words, the densities are 25M/62k/162k ~ 0.002 for the full set and 101k/9.7k/610 ~ 0.02 for yours.
So, you are using a denser part of the data. That’s good!
Suggestions:

before doing (1), please print the value of your current
Ymean_train
andYmean_test
, and share them in your next reply. 
In cell #38, use
Ymean_train
to normalizeY_test
. There shouldn’t be a separateYmean_test
. 
In cell #31, set seed for reproducibility

In the last two cells, show the values of their
count

Can we not do cell #24 now? It seems irrelevant and I lose track of the density because of that.

To be reviewerfriendly, suggest to remove unnecessary cells such as (2, 3, 4, 5, 6, 7, 12, 13, 22, 23, 27). This can save some memory too.

Again, to be reviewerfriendly, add code for how the others can extract your dataset from the full dataset.
Cheers,
Raymond
 add the following to the end:
print(
'Portion of users not exist in the training set: '
f'{(R_train.sum(axis=0) == 0).mean()}'
)
 If not do cell #24, you will need to define a
num_feature
yourself. It is a hyperparameter for the degree of freedom of your model. If it is too large, you can overfit the model to the training data, which is appearing so from your train/test losses. Suggest you to apply (1) & (2) with a selfdefinednum_feature = 1128
. 1128 is your current value. Then log the MAEs. After that, try a few differentnum_feature
such as 50, 100, 500, 2000 and see how the MAEs change.
A table like below will be very helpful for the discussion. Make sure reproducibility.
num_feature  lambda  epochs  train MAE  test MAE 

1128  1  2000  …  … 
…  1  2000  …  … 
Hello rmwkwok!
Thank you very much for the review! I am just now working on corrections.
 Value of your current
Ymean_train
andYmean_test
:
The values are 8785x1 arrays. Should I send them as csv?
They differ from each other due to the previous separation of Y_df into Y_test and Y_train. The following code snippet hides some user ratings for Y_train and moves them to Y_test. :
Y_train = Y_df.to_numpy()
R_train = R_df.to_numpy()
X_train = X_df.to_numpy()
Y_test = np.zeros(Y_train.shape)
R_test = np.zeros(R_train.shape)
X_test = X_df.to_numpy()
for (x,y), value in np.ndenumerate(R_train):
if R_train[x, y] == 1:
r = random.random()
if r >= 0.8:
Y_test[x, y] = Y_train[x,y]
R_train[x,y] = 0
Y_train[x,y] = 0
R_test[x,y] = 1
Is this correct?
 I apologize for the confusing name when reading the csv data (ml25m). The imported files are the unedited csv from the dataset below (mllatestsmall):
Thanks again for the review. I’m taking the time to correct the suggested parts and will be sure to upload the revised code soon.
I have corrected the indicated things
 and 1. I will skip for now, in the post above I shared the values of Ymean_train and Ymean_test.

I added seed for random. Now the results are repeatable.

I shown these values under y_train and y_test split (cell #17) . Now i calculate MAE for all num_features values (cell #23) but count values are the same in each case.

These operations was completely unnecessary. Removed

I removed all unnecessary stuff. I hope it will be more pleasant to read now.

Adds a link to a smaller dataset I used. It has not been edited:
https://files.grouplens.org/datasets/movielens/mllatestsmall.zip 
Is this outcome correct?

I have tested this for 6 num_feature values: 1, 10, 50, 500, 1000, 2000
The results surprised me and I think I am doing something wrong.
MAE for the test set is not much different for num_feature = 1 and num_feature = 2000. I guess it shouldn’t look like that.
Is it related to the 0. and 1. points?
I also added some charts:
Do I understand correctly that the best number of features is 10? It seems so to me because of the fact that overfit is the lowest for this value.
I would be very pleased if you would look at the applied corrections and results. I hope it’s a bit better.
The revised code is attached.
recomender_fix.ipynb (261.7 KB)
I recommend you start with lambda = 0, adjust the learning rate and number of iterations, to establish whether you have overfitting of the training set, and then hold all other factors constant while varying lambda.
My response first:
That’s fine. You may remove the print in subsequent notebook, thanks
I think it’s fine!
No problem at all! So this small dataset has a higher density, which is good!
Thanks. You may remove these prints as well.
Yes! It is a check for you and I to see. Now we know that every user was represented at least once in the training set.
You may remove this check too.
This post is about making things faster. All of your subsequent experiments will benefit from that.
However, it is fine that you skip this.
Below are some places I will change. They have different levels of impact to the speed. The general idea is to embrace the tensorflow way, use tf.float32
, and remove unnecessary operations.
 decorate with
@tf.function
@tf.function
def cofi_cost_func_v(X, W, b, Y, R, lambda_):
...

Initialize
W
with shape(num, num_users)
so we do not need to dotf.transpose
everytime. This saves an operation. Changes required in the definition ofW
, incofi_cost_func_v
, and in the code for making predictions. 
Use only tensorflow objects for training + Use
tf.float32
# Here we convert numpy arrays into tensors, and use tf.float32. We will use these tensors for the training.
Ynorm_train_tf = tf.convert_to_tensor(Ynorm_train, dtype=tf.float32)
Ynorm_test_tf = tf.convert_to_tensor(Ynorm_test, dtype=tf.float32)
R_train_tf = tf.convert_to_tensor(R_train, dtype=tf.float32)
R_test_tf = tf.convert_to_tensor(R_test, dtype=tf.float32)
...
W = tf.Variable(tf.random.normal((num, num_users),dtype=tf.float32), name='W')
X = tf.Variable(tf.random.normal((num_movies, num),dtype=tf.float32), name='X')
b = tf.Variable(tf.random.normal((1, num_users), dtype=tf.float32), name='b')
 calculate test cost only when needed
if iter % 20 == 0:
cost_value_test = cofi_cost_func_v(X, W, b, Ynorm_test_tf, R_test_tf, lambda_)
...
 declare training step as a
tf.function
for num in num_features:
...
# Instantiate an optimizer.
optimizer = keras.optimizers.Adam(learning_rate=1e1)
train_l = []
test_l = []
# Define this `train_step` function inside the for loop to avoid error. There is a neater way, but let's live with it for now.
@tf.function
def train_step(X, W, b, Ynorm_train_tf, R_train_tf, lambda_, optimizer):
# Use TensorFlow’s GradientTape
# to record the operations used to compute the cost
with tf.GradientTape() as tape:
# Compute the cost (forward pass included in cost)
cost_value = cofi_cost_func_v(X, W, b, Ynorm_train_tf, R_train_tf, lambda_)
# Use the gradient tape to automatically retrieve
# the gradients of the trainable variables with respect to the loss
grads = tape.gradient( cost_value, [X,W,b] )
# Run one step of gradient descent by updating
# the value of the variables to minimize the loss.
optimizer.apply_gradients( zip(grads, [X,W,b]) )
return cost_value
# Loop over iterations using the `train_step` defined per each `num_features` round
for iter in range(iterations):
cost_value = train_step(X, W, b, Ynorm_train_tf, R_train_tf, lambda_, optimizer)
Feel the difference and the feasibility of doing more experiments!
You feel the difference once, and you will be motivated to spend time on this kind of speed optimization for the rest of your life
Here comes the main dish  part 1!
Let’s not look at the result now because we first need to do it right
First, I can see why you would have skipped (1), because if I were you, I would have wanted to look for the effect of one change at a time. Also, training speed might also be a problem. (If you have other reasons, let me know  just to share how we think things ).
I say we need to apply (1) no matter what, for two reasons:

if you replace the test set with realworld data, you will see that we cannot have a separate mean for it, because we do not have the ratings  the ratings are for us to predict.

the model is trained to predict the ratings with a scale set out by
Ymean_train
, so whatever ratings predicted by the model has to be 'denormalized" byYmean_train
. This idea was also taught in course 1 & 2, too.
So, this post is about asking you to apply (1).
Main dish part 2

with
Ytest_mean
(which, as explained, is something we do not suppose to have), there is a chance that the test score will be better than it should. 
the range of the ratings is from 0.5 to 5 with an increment step of 0.5. In other words, a MAE of the level 0.5 to 1.0 means an error of one to two steps. This is not bad!
Some final remarks for the last few posts

Thanks for tidying up the notebook! As we build on top of the current code, things can grow too large… if the organization of the code and everything are kept taken care of, it would be wonderful.

I agree with Tom on more experiments around lambda.

I think with what we have covered so far, your notebook would have achieved what it meant to be.
Cheers,
Raymond
@paweldro Dear Pawel,
Also I am not sure if it is referenced in that course but you might find a reading of the related papers for the BellKor solution to the Netflix prize to be useful:
I mean apart from the oversize contribution of SVD (Singular Value Decomposition), most interesting to me is a bit of feature engineering turns out to be useful, and in the end they used an ensemble So not just ‘one’ method.
Thank you very much! I have implemented your tips and am very impressed. Previously the calculations took me 1520 minutes, now the same takes about two minutes. I will definitely use these tips in future implementations.
Now I fully understand why it is necessary. Thanks a lot. Corrected!
def normalizeRatings(Y, R):
"""
Preprocess data by subtracting mean rating for every movie (every row).
Only include real ratings R(i,j)=1.
[Ynorm, Ymean] = normalizeRatings(Y, R) normalized Y so that each movie
has a rating of 0 on average. Unrated moves then have a mean rating (0)
Returns the mean rating in Ymean.
"""
Ymean = (np.sum(Y*R,axis=1)/(np.sum(R, axis=1)+1e12)).reshape(1,1)
Ynorm = Y  np.multiply(Ymean, R)
return(Ynorm, Ymean)
Ynorm_train, Ymean_train = normalizeRatings(Y_train, R_train)
Ynorm_test = Y_test  np.multiply(Ymean_train, R_train)
And for denormalization:
ptrain = p + Ymean_train
I will share tests from the revised version of the notebook:
Now the results seem more reasonable. I am very grateful for your help, thanks!
I attach the current version of the notebook, maybe it will be useful to someone.
recomender_fix_2.ipynb (256.5 KB)
Now, following TMosh’s advice, I will work on lambda
@paweldro I have to admit I haven’t dug into your code though you seem to be progressing Out of curiosity though I do have to ask what exactly you mean by ‘number of features’ and how, exactly, you are magically increasing those (?)
Perhaps you mean some other term ? Or in an unsophisticated sort of way, I like to think of ‘features’ as the points on which the data ‘pivots’, and in the end, you only have so many of those (Unless you’re running like a full FFT and just want to totally overfit the whole thing).
Just curious.
Hello Nevermnd!
It wasn’t mentioned, but I’m pleased to learn about it.
Thank you for the papers! Feature engineering is a topic I would like to explore more. I have also heard that many people who win competitions on Kaggle use ensemble.
I’m already explaining! If I’m making a mistake I’d be glad to hear about it.
I use this cost function:
The parameters X, W and b are initialized as arrays:
for num in num_features:
# Set Initial Parameters (W, X), use tf.Variable to track these variables
tf.random.set_seed(1234) # for consistent results
W = tf.Variable(tf.random.normal((num, num_users),dtype=tf.float32), name='W')
X = tf.Variable(tf.random.normal((num_movies, num),dtype=tf.float32), name='X')
b = tf.Variable(tf.random.normal((1, num_users), dtype=tf.float32), name='b')
Where num_features is:
num_features = [1, 10, 50, 500, 1000, 2000]
The for loop trains six models for all values (1, 10, 50, 500, 1000, 2000). The results for all six models are then displayed.
Now I’m changing the approach and I’m going to choose a fixed size X (num_movies = 9724, num_features = 1000) and adjust the lambda parameter:
num_features = 1000
lambda_ = [0, 0.01, 0.1, 1, 10, 100]
for lamb in lambda_:
# Set Initial Parameters (W, X), use tf.Variable to track these variables
tf.random.set_seed(1234) # for consistent results
W = tf.Variable(tf.random.normal((num_features, num_users),dtype=tf.float32), name='W')
X = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float32), name='X')
b = tf.Variable(tf.random.normal((1, num_users), dtype=tf.float32), name='b')
Finally, cost functions for 6 models with different lambda parameters will be displayed. Below I send what it looks like at this point, but it still needs some work from what I can see:
I’m concerned about the shape of the lambda = 0 curve.
Have you verified the initial Adam optimizer learning rate?