Also I recommend you you fix the vertical scale so it’s the same in all your plots. This makes comparisons much easier.
Hello @paweldro, and @Nevermnd, some people would also call “number of features” as “embedding size” / “number of latent features” / “number of factors”. We are actually assuming a vector for each user and another vector for each movie. The size is a hyperparameter adjustable - the larger the size, the longer the vector it is and also the richer representation it is.
However, being “richer” is not necessarily a good thing because it is computationally more expensive and, more importantly, it means more trainable parameters that is more probable to overfit to the training data.
Cheers,
Raymond
Hello @paweldro,
It is amazing to see that you are making so much progress, and I am glad to hear that you had felt the speed difference!
Are you satisfied? Still want more? If so, I can suggest another two changes - they are about how you normalize the ratings.
Currently, we are doing this -
I hope the addition of the last term does not look strange to you!
(PS: I prefer to call it an item instead of a movie, so you know the translation.)
Change 1
Having said the last term is a fixed one, we can actually let the model learn it, in other words -
Consequently, no mean normalization (and de-normalization) will be required!
Change 2
Observing that the valid ratings is in the range from 0.5 to 5, it is a perfect case for us to use sigmoid!
Noting that the use of sigmoid does not mean we are converting it into a logistic regression problem - we are NOT! We still keep the same squared error loss function, only we now wrap it with a sigmoid.
You may notice that there are two ways that you can implement the part of \times 4.5 + 0.5 - either putting it into the base model, or add a normalization (and denormalization) step.
Change 1+2 OR Change 1/2 ?
You may implement both or either of them. However, without 1, the formula in 2 may need to be re-thought.
Change 1 is likely to affect the performance, while change 2 both the performance & speed. As for whether each of the effects is good/bad and how good/bad it is, it will be up to you to decide whether to find out or not.
If you just implement 1+2 (which is definitely NOT a bad choice), you will then only know the combined effect and not the individual ones.
Discussion
There is a reason for Andrew to have that term (in change 1) fixed instead of trainable. If you still have a good memory of the lecture videos, it might be a good game to make some guesses during model training. Also, are the two versions (fixed & trained) the same?
The use of sigmoid actually eases the model from the burden of possible errors due to going beyond the allowable range, which is 0.5 to 5. You might have a good chance to save a lot of iterations because of that!! (Saving time again!)
If you add a normalization step for change 2, you will be subtracting Y
with 0.5, leaving some negative elements in the final Y
matrix. Another game - why isn’t that a problem ?
Saving time is super important! Performance boost while saving time is super super wonderful!
Cheers,
Raymond
PS: I left out the R matrix in all of the above formula because it is only for the training process. I mean, R is still important!
@paweldro, I didn’t show what code changes are needed for the above two changes, but if you do it then I will do it - if you implement them, then I will also make a notebook!
Cheers!
Hello TMosh,
Thank you for this advice, now the results are definitely more readable.
I tested different learning rate variants (num_feature = 1000). It is hard for me to determine a single value of learning rate parameter for all variants of the lambda parameter.
If i choose:
optimizer = keras.optimizers.Adam(0.001)
I’m getting:

For smaller learning rate values:
optimizer = keras.optimizers.Adam(0.0001)
I’m getting:
Here we can see that it does not bring good results.
I also tested SGD (learning rate = 0.0001 gives me the best results):
optimizer = keras.optimizers.SGD(0.0001)
For which the results look like this:

But here I did not achieve good results for lambda = 1.
During testing, I came to the conclusion that the value num_feature = 1000 is too high for a dataset of this size (I get the best results for lambda > 10). If this parameter is high it also negatively affects the speed of the algorithm.
I decided to test num_feature = 10 as well:
I was also curious what the results would look like for num_feature = 1:
Here I had to change the learning rate:
optimizer = keras.optimizers.Adam(0.0003)

There was some doubt here. It came to my mind that to solve this problem the best value of parameter num_feature would be 1. Shouldn’t I get visibly better results for larger values of num_feature?
Hi @rmwkwok,
Thank you for the clarification. I had a little problem with the proper naming of this parameter. It definitely seems clearer to me now.
Thank you kindly!
I am more than satisfied. The difference in speed is beyond description. I definitely want more
I am taking on the implementation of the proposed changes. As soon as I manage to do it I will share the results!
If you change the number of features, you’ll probably also need to change the number of iterations.
Sorry, there’s too much information in these recent posts for me to comprehend quickly.
I intentionally chose too large a number of iterations (20000) so that it would be possible to test all variants. Should I increase it?
It’s no problem at all. I think I included a little too much information in the post
You certainly seem to be twisting all of the correct knobs. It’s difficult to know when to stop.
Hi @rmwkwok,
I have been working on implementing the changes you mentioned, but along the way I ran into a problem with Change 1. I will briefly describe what actions I took and what the results look like:
Change 0:
I decided to select the initial parameters as the best for the earlier model to have a reference point:
num_features = 10
optimizer = keras.optimizers.Adam(0.1)
iterations = 200
lambda_ = [14]

Change 1:
I removed the mean normalization and added the appropriate trainable parameter to the cost function:
@tf.function
def cofi_cost_func_v(X, W, b, bn, Y, R, lambda_):
j = (tf.linalg.matmul(X, W) + b + bn - Y)*R
J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
return J
And here I have to stop for a moment because something strange is happening:
I got better results, but the loss for the test is less than the loss for train. Something is clearly wrong here, but the addition of the parameter was the only change. Should I be looking for a bug in the code?
Of course, I am attaching a notebook with the change so far.
recomender_fix_3_change1.ipynb (88.6 KB)
Hello @paweldro,
Sorry for getting back late, and as always, so happy to see your progress! There is a lot ahead! Now that you are implementing it, I would better get mine ready too!
We need to make sure it bug-free but a lower loss curve does not equal to a problem. Think of it this way, you are seeing the average test loss, but inside, some test samples are going to do better than the others, right? Therefore, which one is higher depends on the trained model & the difference in the sample distributions between your train and test sets. What we don’t want to see from the curves is the sign of overfitting / underfitting, which were covered by Andrew in his lectures.
Cheers,
Raymond
Pawel @paweldro,
Since you have done change 1, let me suggest to put change 2 aside for now, and then add the following terms as “change 3”
Note that the more changes you add, the more we go away from the lecture, which is exciting, isn’t it?
- a new \mu term.
- With the new \mu term to take care of the overall bias, we can safely add the two b's into the list of regularized terms. So, we regularize all of X, W, and the b’s with the same lambda.
Some people call the group of the last 3 terms as the baseline model in CF, for it represents the minimal interactions between the user-based weights and the item-based weights. Then we add the “latent factors” that contribute more interactions between users & items.
I recommend change 3 over change 2, if you had paused at change 1.
Hope you will be seeing more improvements!
Cheers,
Raymond
Hello @paweldro,
Regarding your observation about the test curve, you might do a CV to see if the observation is consistent with different sets of train/test. A quicker check than implementing the CV is to re-run your notebook with a different random seed (that is responsible for the train/test split), and repeat this for 5 times. Also, are you still using the “sum squared errors”? If so, then it means that the loss is proportional to the dataset size, and don’t you have a larger training set size?
Pawel, it is right to check if there is a bug, and it is right to hold and think about the test curve, but since machine learning is also an art, what I can do is to make suggestions, and the rest will be up to you
I will most likely be back on Friday with my notebook (from scratch). Since it is boring to just have what you already have, I am going to also add the “implicit feedback” and “neighbourhood” components. Together with the “baseline” and “latent factors” components, they form a collaborative filtering algorithm that I believe is a pretty entry-level thing for any one in this path. Since my notebook is unlikely to explain them, you may read this year 2008 paper or google to find out more. If you have time, it is a very good idea that you implement any of those components first, because then you will be in a better position to understand what my notebook will do.
I have enjoyed this thread and I am interested in making my notebook so I will make it anyway. I know I have suggested a lot, and I would just like to tell you that it is always your choice to whether do what I suggest or not, when to do them, or whether to give me an update or not
I respect your choice!!
Cheers!
Raymond
PS: we have been doing batch training here, with the whole dataset in memory and at training. Maybe it is a good idea for my notebook to do it mini-batch-wise, so it can work with the larger 25m dataset that I thought you were using at first?
Hello @rmwkwok,
Sorry for the late reply, I had a few busy days
Thank you for confirming that this does not necessarily mean a bug in the code.
I checked everything carefully and found nothing that could be a potential wrong.
I tested and the results are consistent for different random seeds.
I hadn’t thought of that, but now it seems quite logical to me. Thanks for all the explanations and advices! I will keep them in mind for future implementations.
@rmwkwok
I chose to test all the changes you suggested so I could do a little comparison!
I will present the changes and results below. Under each change I will also insert the code, maybe it will be useful to someone in the future.
Change 0:
@tf.function
def cofi_cost_func_v(X, W, b, Y, R, lambda_):
j = (tf.linalg.matmul(X, W) + b - Y)*R
J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
return J
I decided to show these initial parameters again for the continuity. They will be the same for each subsequent change to make the comparison meaningful.
num_features = 10
optimizer = keras.optimizers.Adam(0.1)
iterations = 200
lambda_ = [14]

In addition, I measured the training time in seconds: 4.81
Code: recomender_fix_3_nochange.ipynb (87.2 KB)
Change 1:
@tf.function
def cofi_cost_func_v(X, W, b, bn, Y, R, lambda_):
j = (tf.linalg.matmul(X, W) + b + bn - Y)*R
J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
return J
Time: 5.28
Code: recomender_fix_3_change1.ipynb (87.4 KB)
Change 1+2:
@tf.function
def cofi_cost_func_v(X, W, b, bn, Y, R, lambda_):
f = tf.linalg.matmul(X, W) + b + bn
j = (tf.math.sigmoid(f)*4.5 + 0.5 - Y)*R
J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2))
return J
Time: 5.11
Code: recomender_fix_3_change1+2.ipynb (86.6 KB)
Change 3:
@tf.function
def cofi_cost_func_v(X, W, b, bn, miu, Y, R, lambda_):
j = (tf.linalg.matmul(X, W) + b + bn + miu - Y)*R
J = 0.5 * tf.reduce_sum(j**2) + (lambda_/2) * (tf.reduce_sum(X**2) + tf.reduce_sum(W**2) + tf.reduce_sum(b**2) + tf.reduce_sum(bn**2))
return J

Time: 4.92
Code: recomender_fix_3_change1+3.ipynb (88.9 KB)
Summary:
I realize that this measurement of time with this number of iterations may not be the best idea, but I wanted to show that despite the addition of the new trainable parameters (1, 1+2), the time has decreased only slightly (this may matter, of course, with a larger dataset/quantity of iterations), but you can see the improvement in MAE.
An interesting twist can be seen when the sigmoid fucation is applied (change 1+2) → fewer iterations needed for convergence!
It seems to me that the most effective was change 3, after a couple of attempts the time seems to be close to change 0 and MAE is the best for all so far.
In addition, it retains the 1+2 change property → convergence is occurring faster
Thank you very much for proposing these changes. I learned a lot and had a very good time applying them. It was great fun!
I hope this summary makes sense and maybe in the future it will be useful to someone making their own attempt to implement this algorithm.
I look forward to your note, it will be a good experience to see the work of someone more experienced on a similar topic. I’m curious about your results!
Thank you for the paper, I will read before looking into your notebook to have a better understanding on the topic.
As for the implementation of these components, I will try next week because I have a busy weekend ahead of me.
That’s a very good idea
Best regards,
Paweł
Hello Pawel @paweldro!
Yea… The train/test ratio is 4? So if we divide the train curve by 4, then it would still be beneath the test one.
Interesting!!
Time = running time to finish 200 epochs? This time measures the computational complexity.
Another “time” to look at is how many iterations needed to converge, right? The difference between the least and the most one (4 vs 2) is quite big, but it is unsure whether the current hyperparameter settings already deliever their best convergence speed for the 4 models, but I won’t expect a huge difference here.
Yes. And because sigmoid does not allow < 0 and > 5 predictions, you won’t get that error. Therefore, to fairer compare sigmoid’s MAE with the others’, it’s better to clip others’ prediction to within 0 & 5.
I am currently on it! Actually I had been tied up too, but now I am on it. Hope it won’t be too long a wait. I will make it so that we can process the larger dataset with limited memory, so it will be an overkill for our smaller dataset.
I can tell you that I am having fun, too!
Cheers,
Raymond
Hello, Pawel @paweldro,
How are you? It was a very fruitful month for me, and I hope it was the same to you!
I am just finally able to wrap up a repo and share with you! I took the chance to experiment a few coding-style things and clear a few items in my to-do list, including plotting training curves like a pivot table, using dataclasses everywhere, making a simple Pipeline tool for fun, writing some cython, and more…
I hope you will find something new there!
Cheers,
Raymond
PS: I have only tried the small movielens dataset but not yet the full one, as I am going to need my CPU time for something else, again…