Collaborative filtering training dataset question

Intuitively, collaborative filtering recommend items to users based on their common interactions with items. So for example, if user A rated Movie X and Y high and User B rated movie X hight. then user B might also rate movie Y high. My question is that if the training dataset contains a lot of users that only rated 1 move (lets say 70% of your dataset) is that create noise in the model ? is the model benefit form those training point ? or are we better off removing some or all of those rows from the training set?

I don’t have a simple yes/no answer to your questions. Here are some points for consideration:

  1. It depends on whether your training set replicates the real world situation. If in the real world, 70% of human only watched that one movie, then I think your dataset is fine. Otherwise, it may be a sign that your sample taking process might have been biased and you need to correct that.

  2. Whether that movie is advantageous. For this you need to compare a model with that movie and a movel without that movie.

Raymond

Thanks for the reply Raymond,
maybe I wasn’t clear in my question, I mean that a lot of users have only one interactions so in our example rate only one movie (could be any movie in the set) so the R vector for these users has all zeros in the user column except the j row corresponding to the only movie they rate.
sorry if my question is still not clear

I think you have asked the questions clearly in your first and second post, but my answer doesn’t change:

  1. check if you data sampling process is biased, if so, correct it.
  2. experiment whether the movie is advantageous or not.

My answers are action items :wink:

@Faraz_Zahabian, let’s think about it this way, even if I follow the flow of thought in your first post to remove some rows from the training set, how do I know how many rows I should remove? Right? My suggestion number 1 will first make sure we know what the world is really like (versus a training set which is only a part of the world). That is extremely important because it will also guide us what to “add”, not just what to “remove”. We can’t keep removing things.

My suggestion number 2 is the most proper way to give you the answer on what is the best thing to do.

Faraz, I can only suggest you action items.

Cheers,
Raymond

fair enough :smiley:
Thanks