Collaborative Filtering - problem with implementation on raw dataset

paweldro · May 5, 2024, 3:57pm

Hi, I have a problem with the implementation of the collaborative filtering on ml-latest-small dataset (MovieLens | GroupLens). I split the data into train and test sets and try algorithm from C3_W2_Collaborative_RecSys_Assignment. The results I checked for one user and for the test set are far from decent.
I’m completely stuck. Did I incompetently split the dataset or should I change something in the implementation? I am sending the notebook I was working on as an attachment.
recomender_test.ipynb (186.7 KB)

paweldro · May 5, 2024, 6:22pm

I found a problem with splitting the dataset. The correct approach to this problem is:

([RecSys] Implementation on Variants of SVD-Based Recommender System | by Tom Lin | Towards Data Science)

I am sending the revised version as an attachment. The code seems to work, but I’m still not sure if this is the correct approach to the problem.

If anyone would like to review it, I would greatly appreciate criticism.

recomender_test.ipynb (142.4 KB)

rmwkwok · May 6, 2024, 7:11am

Hello @paweldro,

Great try!

The full dataset has 25M ratings of 62k movies + 162k users. You included only 101k ratings of 9.7k movies + 610 users.

In other words, the densities are 25M/62k/162k ~ 0.002 for the full set and 101k/9.7k/610 ~ 0.02 for yours.

So, you are using a denser part of the data. That’s good!

Suggestions:

before doing (1), please print the value of your current Ymean_train and Ymean_test, and share them in your next reply.
In cell #38, use Ymean_train to normalize Y_test. There shouldn’t be a separate Ymean_test.
In cell #31, set seed for reproducibility
In the last two cells, show the values of their count
Can we not do cell #24 now? It seems irrelevant and I lose track of the density because of that.
To be reviewer-friendly, suggest to remove unnecessary cells such as (2, 3, 4, 5, 6, 7, 12, 13, 22, 23, 27). This can save some memory too.
Again, to be reviewer-friendly, add code for how the others can extract your dataset from the full dataset.

Cheers,
Raymond

rmwkwok · May 6, 2024, 7:25am

add the following to the end:

print(
    'Portion of users not exist in the training set: '
    f'{(R_train.sum(axis=0) == 0).mean()}'
)

rmwkwok · May 6, 2024, 7:30am

If not do cell #24, you will need to define a num_feature yourself. It is a hyperparameter for the degree of freedom of your model. If it is too large, you can overfit the model to the training data, which is appearing so from your train/test losses. Suggest you to apply (1) & (2) with a self-defined num_feature = 1128. 1128 is your current value. Then log the MAEs. After that, try a few different num_feature such as 50, 100, 500, 2000 and see how the MAEs change.
A table like below will be very helpful for the discussion. Make sure reproducibility.

num_feature	lambda	epochs	train MAE	test MAE
1128	1	2000	…	…
…	1	2000	…	…

paweldro · May 6, 2024, 6:11pm

Hello rmwkwok!
Thank you very much for the review! I am just now working on corrections.

Value of your current Ymean_train and Ymean_test:

Screenshot from 2024-05-06 19-21-40652×600 46.2 KB

The values are 8785x1 arrays. Should I send them as csv?
They differ from each other due to the previous separation of Y_df into Y_test and Y_train. The following code snippet hides some user ratings for Y_train and moves them to Y_test. :

Y_train = Y_df.to_numpy()
R_train = R_df.to_numpy()
X_train = X_df.to_numpy()

Y_test = np.zeros(Y_train.shape)
R_test = np.zeros(R_train.shape)
X_test = X_df.to_numpy()

for (x,y), value in np.ndenumerate(R_train):
    if R_train[x, y] == 1:
        r = random.random()
        
        if r >= 0.8:
            Y_test[x, y] = Y_train[x,y]
            R_train[x,y] = 0
            Y_train[x,y] = 0
            R_test[x,y] = 1

Is this correct?

I apologize for the confusing name when reading the csv data (ml-25m). The imported files are the unedited csv from the dataset below (ml-latest-small):

Thanks again for the review. I’m taking the time to correct the suggested parts and will be sure to upload the revised code soon.

paweldro · May 7, 2024, 12:07am

I have corrected the indicated things

and 1. I will skip for now, in the post above I shared the values of Ymean_train and Ymean_test.

I added seed for random. Now the results are repeatable.
I shown these values under y_train and y_test split (cell #17) . Now i calculate MAE for all num_features values (cell #23) but count values are the same in each case.

Screenshot from 2024-05-07 01-20-46

These operations was completely unnecessary. Removed
I removed all unnecessary stuff. I hope it will be more pleasant to read now.
Adds a link to a smaller dataset I used. It has not been edited:
https://files.grouplens.org/datasets/movielens/ml-latest-small.zip
Is this outcome correct?
I have tested this for 6 num_feature values: 1, 10, 50, 500, 1000, 2000

Screenshot from 2024-05-07 01-41-11

The results surprised me and I think I am doing something wrong.
MAE for the test set is not much different for num_feature = 1 and num_feature = 2000. I guess it shouldn’t look like that.
Is it related to the 0. and 1. points?

I also added some charts:

Do I understand correctly that the best number of features is 10? It seems so to me because of the fact that overfit is the lowest for this value.

I would be very pleased if you would look at the applied corrections and results. I hope it’s a bit better.
The revised code is attached.
recomender_fix.ipynb (261.7 KB)

TMosh · May 7, 2024, 12:25am

I recommend you start with lambda = 0, adjust the learning rate and number of iterations, to establish whether you have overfitting of the training set, and then hold all other factors constant while varying lambda.

rmwkwok · May 7, 2024, 5:36am

My response first:

That’s fine. You may remove the print in subsequent notebook, thanks

I think it’s fine!

No problem at all! So this small dataset has a higher density, which is good!

Thanks. You may remove these prints as well.

Yes! It is a check for you and I to see. Now we know that every user was represented at least once in the training set.

You may remove this check too.

rmwkwok · May 7, 2024, 5:51am

This post is about making things faster. All of your subsequent experiments will benefit from that.

However, it is fine that you skip this.

Below are some places I will change. They have different levels of impact to the speed. The general idea is to embrace the tensorflow way, use tf.float32, and remove unnecessary operations.

decorate with @tf.function

@tf.function
def cofi_cost_func_v(X, W, b, Y, R, lambda_):
   ...

Initialize W with shape (num, num_users) so we do not need to do tf.transpose everytime. This saves an operation. Changes required in the definition of W, in cofi_cost_func_v, and in the code for making predictions.
Use only tensorflow objects for training + Use tf.float32

# Here we convert numpy arrays into tensors, and use tf.float32. We will use these tensors for the training.
Ynorm_train_tf = tf.convert_to_tensor(Ynorm_train, dtype=tf.float32)
Ynorm_test_tf = tf.convert_to_tensor(Ynorm_test, dtype=tf.float32)
R_train_tf = tf.convert_to_tensor(R_train, dtype=tf.float32)
R_test_tf = tf.convert_to_tensor(R_test, dtype=tf.float32)

...


    W = tf.Variable(tf.random.normal((num, num_users),dtype=tf.float32),  name='W')
    X = tf.Variable(tf.random.normal((num_movies, num),dtype=tf.float32),  name='X')
    b = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float32),  name='b')

calculate test cost only when needed

        if iter % 20 == 0:
            cost_value_test = cofi_cost_func_v(X, W, b, Ynorm_test_tf, R_test_tf, lambda_)
            ...

declare training step as a tf.function

for num in num_features:
    
    ...

    # Instantiate an optimizer.
    optimizer = keras.optimizers.Adam(learning_rate=1e-1)


    train_l = []
    test_l = []

    # Define this `train_step` function inside the for loop to avoid error. There is a neater way, but let's live with it for now.
    @tf.function
    def train_step(X, W, b, Ynorm_train_tf, R_train_tf, lambda_, optimizer):
        # Use TensorFlow’s GradientTape
        # to record the operations used to compute the cost 
        with tf.GradientTape() as tape:
    
            # Compute the cost (forward pass included in cost)
            cost_value = cofi_cost_func_v(X, W, b, Ynorm_train_tf, R_train_tf, lambda_)
            
        # Use the gradient tape to automatically retrieve
        # the gradients of the trainable variables with respect to the loss
        grads = tape.gradient( cost_value, [X,W,b] )
    
        # Run one step of gradient descent by updating
        # the value of the variables to minimize the loss.
        optimizer.apply_gradients( zip(grads, [X,W,b]) )
    
        return cost_value
    
    # Loop over iterations using the `train_step` defined per each `num_features` round
    for iter in range(iterations):

        cost_value = train_step(X, W, b, Ynorm_train_tf, R_train_tf, lambda_, optimizer)

Feel the difference and the feasibility of doing more experiments!

You feel the difference once, and you will be motivated to spend time on this kind of speed optimization for the rest of your life

rmwkwok · May 7, 2024, 6:03am

Here comes the main dish - part 1!

Let’s not look at the result now because we first need to do it right

First, I can see why you would have skipped (1), because if I were you, I would have wanted to look for the effect of one change at a time. Also, training speed might also be a problem. (If you have other reasons, let me know - just to share how we think things ).

I say we need to apply (1) no matter what, for two reasons:

if you replace the test set with real-world data, you will see that we cannot have a separate mean for it, because we do not have the ratings - the ratings are for us to predict.
the model is trained to predict the ratings with a scale set out by Ymean_train, so whatever ratings predicted by the model has to be 'denormalized" by Ymean_train. This idea was also taught in course 1 & 2, too.

So, this post is about asking you to apply (1).

rmwkwok · May 7, 2024, 6:33am

Main dish part 2

with Ytest_mean (which, as explained, is something we do not suppose to have), there is a chance that the test score will be better than it should.
the range of the ratings is from 0.5 to 5 with an increment step of 0.5. In other words, a MAE of the level 0.5 to 1.0 means an error of one to two steps. This is not bad!

Some final remarks for the last few posts

Thanks for tidying up the notebook! As we build on top of the current code, things can grow too large… if the organization of the code and everything are kept taken care of, it would be wonderful.
I agree with Tom on more experiments around lambda.
I think with what we have covered so far, your notebook would have achieved what it meant to be.

Cheers,
Raymond

Nevermnd · May 7, 2024, 7:29am

@paweldro Dear Pawel,

Also I am not sure if it is referenced in that course but you might find a reading of the related papers for the BellKor solution to the Netflix prize to be useful:

I mean apart from the oversize contribution of SVD (Singular Value Decomposition), most interesting to me is a bit of feature engineering turns out to be useful, and in the end they used an ensemble-- So not just ‘one’ method.

paweldro · May 7, 2024, 7:28pm

Thank you very much! I have implemented your tips and am very impressed. Previously the calculations took me 15-20 minutes, now the same takes about two minutes. I will definitely use these tips in future implementations.

paweldro · May 7, 2024, 7:39pm

Now I fully understand why it is necessary. Thanks a lot. Corrected!

def normalizeRatings(Y, R):
    """
    Preprocess data by subtracting mean rating for every movie (every row).
    Only include real ratings R(i,j)=1.
    [Ynorm, Ymean] = normalizeRatings(Y, R) normalized Y so that each movie
    has a rating of 0 on average. Unrated moves then have a mean rating (0)
    Returns the mean rating in Ymean.
    """
    Ymean = (np.sum(Y*R,axis=1)/(np.sum(R, axis=1)+1e-12)).reshape(-1,1)
    Ynorm = Y - np.multiply(Ymean, R) 
    return(Ynorm, Ymean)

Ynorm_train, Ymean_train = normalizeRatings(Y_train, R_train)
Ynorm_test = Y_test - np.multiply(Ymean_train, R_train)

And for denormalization:

ptrain = p + Ymean_train

paweldro · May 7, 2024, 8:04pm

I will share tests from the revised version of the notebook:
Screenshot from 2024-05-07 21-54-07

Now the results seem more reasonable. I am very grateful for your help, thanks!
I attach the current version of the notebook, maybe it will be useful to someone.
recomender_fix_2.ipynb (256.5 KB)

Now, following TMosh’s advice, I will work on lambda

Nevermnd · May 7, 2024, 8:19pm

@paweldro I have to admit I haven’t dug into your code though you seem to be progressing-- Out of curiosity though I do have to ask what exactly you mean by ‘number of features’ and how, exactly, you are magically increasing those (?)

Perhaps you mean some other term ? Or in an unsophisticated sort of way, I like to think of ‘features’ as the points on which the data ‘pivots’, and in the end, you only have so many of those (Unless you’re running like a full FFT and just want to totally overfit the whole thing).

Just curious.

paweldro · May 7, 2024, 8:24pm

Hello Nevermnd!

It wasn’t mentioned, but I’m pleased to learn about it.
Thank you for the papers! Feature engineering is a topic I would like to explore more. I have also heard that many people who win competitions on Kaggle use ensemble.

paweldro · May 7, 2024, 9:31pm

I’m already explaining! If I’m making a mistake I’d be glad to hear about it.

I use this cost function:

The parameters X, W and b are initialized as arrays:

for num in num_features:
    # Set Initial Parameters (W, X), use tf.Variable to track these variables
    tf.random.set_seed(1234) # for consistent results
    W = tf.Variable(tf.random.normal((num,  num_users),dtype=tf.float32),  name='W')
    
    X = tf.Variable(tf.random.normal((num_movies, num),dtype=tf.float32),  name='X')
    
    b = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float32),  name='b')

Where num_features is:

num_features = [1, 10, 50, 500, 1000, 2000]

The for loop trains six models for all values (1, 10, 50, 500, 1000, 2000). The results for all six models are then displayed.
Now I’m changing the approach and I’m going to choose a fixed size X (num_movies = 9724, num_features = 1000) and adjust the lambda parameter:

num_features = 1000
lambda_ = [0, 0.01, 0.1, 1, 10, 100]

for lamb in lambda_:
    # Set Initial Parameters (W, X), use tf.Variable to track these variables
    tf.random.set_seed(1234) # for consistent results
    W = tf.Variable(tf.random.normal((num_features,  num_users),dtype=tf.float32),  name='W')
    
    X = tf.Variable(tf.random.normal((num_movies, num_features),dtype=tf.float32),  name='X')
    
    b = tf.Variable(tf.random.normal((1,          num_users),   dtype=tf.float32),  name='b')

Finally, cost functions for 6 models with different lambda parameters will be displayed. Below I send what it looks like at this point, but it still needs some work from what I can see:

TMosh · May 7, 2024, 9:40pm

I’m concerned about the shape of the lambda = 0 curve.
Have you verified the initial Adam optimizer learning rate?

Topic		Replies	Views
Collaborative filtering: R matrix calculation Unsupervised Learning, Recommenders, Reinforcement week-module-2	18	581	February 6, 2023
Recommender System NN Question Unsupervised Learning, Recommenders, Reinforcement week-module-2	11	549	February 16, 2023
Week 2 - Collaborative filtering algorithm implementation question Unsupervised Learning, Recommenders, Reinforcement week-module-2	4	276	January 23, 2024
Content based recommender system normalization error Unsupervised Learning, Recommenders, Reinforcement week-module-2	13	175	May 16, 2024
Regression with flattened statistics Supervised ML: Regression and Classification week-module-3	24	596	February 8, 2023

Collaborative Filtering - problem with implementation on raw dataset

Related topics