Question for r(i,j) used in Binary Label recommend system

I have a question about the r(i,j) matrix used in the course.
r(i,j) matrix is defined to have 1 value at the position the user rated the movie, and 0 on the position user did not rate the movie.
It is easy to understand this matrix used in rating system (e.g, rating from 0.5 to 5), but if in binary label system, which is introduced in week2 4th video, how to use this r(i,j) matrix?
My question is that, in binary label system, there are only two options, for example, click (y=1) and not click (y=0), then should we define r(i,j) same as y? or we should not use r(i,j) matrix in binary lable system?

Hello @lxd_1986001,

Below is a screenshot from the Course 3 Week 2 Video 4 which shows that r(i, j) is also used.

As you said, r(i, j) is different from y(i, j) because the former is for “rated or not” and the latter for “rated value”. If in your case, you only have “clicked or not”, then r(i, j) = 1. Note that the purpose for r(i, j) is to screen off invalid user-movie pairs from the cost function so that the model isn’t optimized for those invalid pairs. However, y(i, j) is the label. Therefore, they serve different purposes.


Thank you @rmwkwok .
Now I understand that we should still keep r(i,j) in binary label system, but be careful that the purpose of it.
Thank you!

Hello @lxd_1986001,

I must not have a clear mind last night. If you only have clicked or not, then all r(i,j) are one. Therefore, you can choose to keep it for consistency, or ignore it for simplicity. I will edit my previous answer accordingly.


Thank you @rmwkwok for clarification. I think this will make sence.

Thank you!

You are welcome @lxd_1986001!

Hi Raymond (@rmwkwok),

Thank you for your explanation here. A follow-up question:

If our dataset is sparse (75% did not have an interaction) and we are dealing with a binary label situation as mentioned above, setting all values of r(i,j) =1 will mean that we are computing costs for the entire dataset, including those 75% who did not have an interaction. In such situation, would it be ok to still train on the entire dataset or do you recommend removing the 75% who did not have an interaction from training? I am asking this because I see (from blog posts, etc…) people commonly excludes users who had few/no interactions from training data.


It depends on your model’s formulation. If we follow the assignment’s, then we can’t remove it. If you would like to discuss the feasibility of removing it, then it’s better to base the discussion on the content of one of those blog posts.

A possible alternative approach to not ever consider cases where r(i,j)=0 is, for example, we have an embedding layer for users and an embedding layer for items, then the model is so built that we only pick positive pairs (i.e. r(i,j)=1) to do the dot product (plus bias) and we minimize the difference between the result of the dot product and the true rating.


@rmwkwok , thank you for your swift reply.

  1. Is there a reason why we can’t remove those without ratings/interactions from the assignment (other than for the purpose of consistency in grading)?

  2. What is the general rule of thumb when it comes to including or excluding these 75% who did not have interactions in the case of binary labels? I’ve tried both approaches on the dataset I have at work and i found that including the 75% worked way better vs when I exclude them. I suspect this was due to the mean-normalization step which recommended popular (on average) products to the 75% users. However, I just wanted to know your opinion on this from a broader context.

Best regards,


  1. In the assignment, it uses cofi_cost_func_v in the training process, which computes a part of the cost like this (tf.linalg.matmul(X, tf.transpose(W)) + b - Y)*R. The matrix multiplication will have no choice but compare all possible pairs of users and movies. Therefore we can’t remove them, but to use R to screen them off after we computed them.

  2. Those 75% doesn’t provide any information in the assignment’s problem. It only makes computation more expensive. There is no general rule of thumb to include them. To include them, I think we need justification. The justification behind the assignment is that, I think, it makes the code simpler.

What do you assign to y(i,j) for those with r(i,j)=0? You don’t really need to tell me everything because it may be confidental. However, you may want to think about the rationale behind your assignment of those y(i,j) because those reasons can be your justifications to include them.


For example, if no interaction (i.e. r(i,j)=0) does imply something in your business process, then you may want to include some of them. Is this considered a general rule of thumb :thinking: ?

Thank you @rmwkwok for the clarification