Collaborative Filtering - problem with implementation on raw dataset

Hi, I have a problem with the implementation of the collaborative filtering on ml-latest-small dataset (MovieLens | GroupLens). I split the data into train and test sets and try algorithm from C3_W2_Collaborative_RecSys_Assignment. The results I checked for one user and for the test set are far from decent.
I’m completely stuck. Did I incompetently split the dataset or should I change something in the implementation? I am sending the notebook I was working on as an attachment.
recomender_test.ipynb (186.7 KB)


I found a problem with splitting the dataset. The correct approach to this problem is:

([RecSys] Implementation on Variants of SVD-Based Recommender System | by Tom Lin | Towards Data Science)

I am sending the revised version as an attachment. The code seems to work, but I’m still not sure if this is the correct approach to the problem.

If anyone would like to review it, I would greatly appreciate criticism. :slight_smile:

recomender_test.ipynb (142.4 KB)


Hello @paweldro,

Great try!

The full dataset has 25M ratings of 62k movies + 162k users. You included only 101k ratings of 9.7k movies + 610 users.

In other words, the densities are 25M/62k/162k ~ 0.002 for the full set and 101k/9.7k/610 ~ 0.02 for yours.

So, you are using a denser part of the data. That’s good!


  1. before doing (1), please print the value of your current Ymean_train and Ymean_test, and share them in your next reply.

  2. In cell #38, use Ymean_train to normalize Y_test. There shouldn’t be a separate Ymean_test.

  3. In cell #31, set seed for reproducibility

  4. In the last two cells, show the values of their count

  5. Can we not do cell #24 now? It seems irrelevant and I lose track of the density because of that.

  6. To be reviewer-friendly, suggest to remove unnecessary cells such as (2, 3, 4, 5, 6, 7, 12, 13, 22, 23, 27). This can save some memory too.

  7. Again, to be reviewer-friendly, add code for how the others can extract your dataset from the full dataset.


1 Like
  1. add the following to the end:
    'Portion of users not exist in the training set: '
    f'{(R_train.sum(axis=0) == 0).mean()}'
1 Like
  1. If not do cell #24, you will need to define a num_feature yourself. It is a hyperparameter for the degree of freedom of your model. If it is too large, you can overfit the model to the training data, which is appearing so from your train/test losses. Suggest you to apply (1) & (2) with a self-defined num_feature = 1128. 1128 is your current value. Then log the MAEs. After that, try a few different num_feature such as 50, 100, 500, 2000 and see how the MAEs change.
    A table like below will be very helpful for the discussion. Make sure reproducibility.
num_feature lambda epochs train MAE test MAE
1128 1 2000
1 2000
1 Like

Hello rmwkwok!
Thank you very much for the review! I am just now working on corrections.

  1. Value of your current Ymean_train and Ymean_test:

    The values are 8785x1 arrays. Should I send them as csv?
    They differ from each other due to the previous separation of Y_df into Y_test and Y_train. The following code snippet hides some user ratings for Y_train and moves them to Y_test. :
Y_train = Y_df.to_numpy()
R_train = R_df.to_numpy()
X_train = X_df.to_numpy()

Y_test = np.zeros(Y_train.shape)
R_test = np.zeros(R_train.shape)
X_test = X_df.to_numpy()

for (x,y), value in np.ndenumerate(R_train):
    if R_train[x, y] == 1:
        r = random.random()
        if r >= 0.8:
            Y_test[x, y] = Y_train[x,y]
            R_train[x,y] = 0
            Y_train[x,y] = 0
            R_test[x,y] = 1

Is this correct?

  1. I apologize for the confusing name when reading the csv data (ml-25m). The imported files are the unedited csv from the dataset below (ml-latest-small):

Thanks again for the review. I’m taking the time to correct the suggested parts and will be sure to upload the revised code soon.


I have corrected the indicated things :slight_smile:

  1. and 1. I will skip for now, in the post above I shared the values of Ymean_train and Ymean_test.
  1. I added seed for random. Now the results are repeatable.

  2. I shown these values under y_train and y_test split (cell #17) . Now i calculate MAE for all num_features values (cell #23) but count values are the same in each case.

Screenshot from 2024-05-07 01-20-46

  1. These operations was completely unnecessary. Removed :slight_smile:

  2. I removed all unnecessary stuff. I hope it will be more pleasant to read now. :slight_smile:

  3. Adds a link to a smaller dataset I used. It has not been edited:

  4. Is this outcome correct?
    Screenshot from 2024-05-07 01-36-08

  5. I have tested this for 6 num_feature values: 1, 10, 50, 500, 1000, 2000

Screenshot from 2024-05-07 01-41-11

The results surprised me and I think I am doing something wrong.
MAE for the test set is not much different for num_feature = 1 and num_feature = 2000. I guess it shouldn’t look like that.
Is it related to the 0. and 1. points?

I also added some charts:

Do I understand correctly that the best number of features is 10? It seems so to me because of the fact that overfit is the lowest for this value.

I would be very pleased if you would look at the applied corrections and results. I hope it’s a bit better. :smiley:
The revised code is attached.
recomender_fix.ipynb (261.7 KB)


I recommend you start with lambda = 0, adjust the learning rate and number of iterations, to establish whether you have overfitting of the training set, and then hold all other factors constant while varying lambda.


My response first:

That’s fine. You may remove the print in subsequent notebook, thanks :wink:

I think it’s fine!

No problem at all! So this small dataset has a higher density, which is good!

Thanks. You may remove these prints as well.

Yes! It is a check for you and I to see. Now we know that every user was represented at least once in the training set.

You may remove this check too.


This post is about making things faster. All of your subsequent experiments will benefit from that.

However, it is fine that you skip this.

Below are some places I will change. They have different levels of impact to the speed. The general idea is to embrace the tensorflow way, use tf.float32, and remove unnecessary operations.

  1. decorate with @tf.function
def cofi_cost_func_v(X, W, b, Y, R, lambda_):
  1. Initialize W with shape (num, num_users) so we do not need to do tf.transpose everytime. This saves an operation. Changes required in the definition of W, in cofi_cost_func_v, and in the code for making predictions.

  2. Use only tensorflow objects for training + Use tf.float32

# Here we convert numpy arrays into tensors, and use tf.float32. We will use these tensors for the training.
Ynorm_train_tf = tf.convert_to_tensor(Ynorm_train, dtype=tf.float32)
Ynorm_test_tf = tf.convert_to_tensor(Ynorm_test, dtype=tf.float32)
R_train_tf = tf.convert_to_tensor(R_train, dtype=tf.float32)
R_test_tf = tf.convert_to_tensor(R_test, dtype=tf.float32)


    W = tf.Variable(tf.random.normal((num, num_users),dtype=tf.float32),  name='W')
    X = tf.Variable(tf.random.normal((num_movies, num),dtype=tf.float32),  name='X')
    b = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float32),  name='b')
  1. calculate test cost only when needed
        if iter % 20 == 0:
            cost_value_test = cofi_cost_func_v(X, W, b, Ynorm_test_tf, R_test_tf, lambda_)
  1. declare training step as a tf.function
for num in num_features:

    # Instantiate an optimizer.
    optimizer = keras.optimizers.Adam(learning_rate=1e-1)

    train_l = []
    test_l = []

    # Define this `train_step` function inside the for loop to avoid error. There is a neater way, but let's live with it for now.
    def train_step(X, W, b, Ynorm_train_tf, R_train_tf, lambda_, optimizer):
        # Use TensorFlow’s GradientTape
        # to record the operations used to compute the cost 
        with tf.GradientTape() as tape:
            # Compute the cost (forward pass included in cost)
            cost_value = cofi_cost_func_v(X, W, b, Ynorm_train_tf, R_train_tf, lambda_)
        # Use the gradient tape to automatically retrieve
        # the gradients of the trainable variables with respect to the loss
        grads = tape.gradient( cost_value, [X,W,b] )
        # Run one step of gradient descent by updating
        # the value of the variables to minimize the loss.
        optimizer.apply_gradients( zip(grads, [X,W,b]) )
        return cost_value
    # Loop over iterations using the `train_step` defined per each `num_features` round
    for iter in range(iterations):

        cost_value = train_step(X, W, b, Ynorm_train_tf, R_train_tf, lambda_, optimizer)

Feel the difference and the feasibility of doing more experiments! :wink: :wink:

You feel the difference once, and you will be motivated to spend time on this kind of speed optimization for the rest of your life :wink: :wink: :wink: :wink:


Here comes the main dish - part 1!

Let’s not look at the result now because we first need to do it right :wink:

First, I can see why you would have skipped (1), because if I were you, I would have wanted to look for the effect of one change at a time. Also, training speed might also be a problem. (If you have other reasons, let me know - just to share how we think things :wink: ).

I say we need to apply (1) no matter what, for two reasons:

  1. if you replace the test set with real-world data, you will see that we cannot have a separate mean for it, because we do not have the ratings - the ratings are for us to predict.

  2. the model is trained to predict the ratings with a scale set out by Ymean_train, so whatever ratings predicted by the model has to be 'denormalized" by Ymean_train. This idea was also taught in course 1 & 2, too. :wink:

So, this post is about asking you to apply (1).


Main dish part 2

  1. with Ytest_mean (which, as explained, is something we do not suppose to have), there is a chance that the test score will be better than it should.

  2. the range of the ratings is from 0.5 to 5 with an increment step of 0.5. In other words, a MAE of the level 0.5 to 1.0 means an error of one to two steps. This is not bad!

Some final remarks for the last few posts

  1. Thanks for tidying up the notebook! As we build on top of the current code, things can grow too large… if the organization of the code and everything are kept taken care of, it would be wonderful.

  2. I agree with Tom on more experiments around lambda.

  3. I think with what we have covered so far, your notebook would have achieved what it meant to be.



@paweldro Dear Pawel,

Also I am not sure if it is referenced in that course but you might find a reading of the related papers for the BellKor solution to the Netflix prize to be useful:

I mean apart from the oversize contribution of SVD (Singular Value Decomposition), most interesting to me is a bit of feature engineering turns out to be useful, and in the end they used an ensemble-- So not just ‘one’ method.


Thank you very much! I have implemented your tips and am very impressed. Previously the calculations took me 15-20 minutes, now the same takes about two minutes. I will definitely use these tips in future implementations.


Now I fully understand why it is necessary. Thanks a lot. Corrected!

def normalizeRatings(Y, R):
    Preprocess data by subtracting mean rating for every movie (every row).
    Only include real ratings R(i,j)=1.
    [Ynorm, Ymean] = normalizeRatings(Y, R) normalized Y so that each movie
    has a rating of 0 on average. Unrated moves then have a mean rating (0)
    Returns the mean rating in Ymean.
    Ymean = (np.sum(Y*R,axis=1)/(np.sum(R, axis=1)+1e-12)).reshape(-1,1)
    Ynorm = Y - np.multiply(Ymean, R) 
    return(Ynorm, Ymean)
Ynorm_train, Ymean_train = normalizeRatings(Y_train, R_train)
Ynorm_test = Y_test - np.multiply(Ymean_train, R_train)

And for denormalization:

ptrain = p + Ymean_train

I will share tests from the revised version of the notebook:
Screenshot from 2024-05-07 21-54-07

Now the results seem more reasonable. I am very grateful for your help, thanks!
I attach the current version of the notebook, maybe it will be useful to someone.
recomender_fix_2.ipynb (256.5 KB)

Now, following TMosh’s advice, I will work on lambda :slight_smile:


@paweldro I have to admit I haven’t dug into your code though you seem to be progressing-- Out of curiosity though I do have to ask what exactly you mean by ‘number of features’ and how, exactly, you are magically increasing those (?)

Perhaps you mean some other term ? Or in an unsophisticated sort of way, I like to think of ‘features’ as the points on which the data ‘pivots’, and in the end, you only have so many of those (Unless you’re running like a full FFT and just want to totally overfit the whole thing).

Just curious.


Hello Nevermnd!

It wasn’t mentioned, but I’m pleased to learn about it.
Thank you for the papers! Feature engineering is a topic I would like to explore more. I have also heard that many people who win competitions on Kaggle use ensemble.


I’m already explaining! If I’m making a mistake I’d be glad to hear about it.

I use this cost function:

The parameters X, W and b are initialized as arrays:

for num in num_features:
    # Set Initial Parameters (W, X), use tf.Variable to track these variables
    tf.random.set_seed(1234) # for consistent results
    W = tf.Variable(tf.random.normal((num,  num_users),dtype=tf.float32),  name='W')
    X = tf.Variable(tf.random.normal((num_movies, num),dtype=tf.float32),  name='X')
    b = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float32),  name='b')

Where num_features is:

num_features = [1, 10, 50, 500, 1000, 2000]

The for loop trains six models for all values (1, 10, 50, 500, 1000, 2000). The results for all six models are then displayed.
Now I’m changing the approach and I’m going to choose a fixed size X (num_movies = 9724, num_features = 1000) and adjust the lambda parameter:

num_features = 1000
lambda_ = [0, 0.01, 0.1, 1, 10, 100]

for lamb in lambda_:
    # Set Initial Parameters (W, X), use tf.Variable to track these variables
    tf.random.set_seed(1234) # for consistent results
    W = tf.Variable(tf.random.normal((num_features,  num_users),dtype=tf.float32),  name='W')
    X = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float32),  name='X')
    b = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float32),  name='b')

Finally, cost functions for 6 models with different lambda parameters will be displayed. Below I send what it looks like at this point, but it still needs some work from what I can see:


I’m concerned about the shape of the lambda = 0 curve.
Have you verified the initial Adam optimizer learning rate?