Content-Based Filtering - Training Data

Hello everyone. I was hoping that someone could provide a bit of clarity on the training datasets in the content-based filtering lab. The lab indicates that "the reduced dataset has 𝑛𝑒=397 users; π‘›π‘š=847 movies and 25521 ratings. However, when I look at the user matrix (e.g. user_train) and movie item matrix (item_train), I noticed that the dimensions of those matrices are 50884 x 17.

The column dimension (m = 17) obviously corresponds to the user and item features. But why does the row dimension (n=50884) of these matrices not correspond to the number of ratings (25521)?


Hello Dan,

Quoted from the assignment:

Some ratings are repeated to boost the number of training examples of underrepresented genre’s.

If you de-duplicate the data, you get 25521 unique rows.


PS: Your question reminds me of the possible cause about a previous question from another learner, thank you! :wink:

1 Like

Thank you! Very helpful.

Just a quick follow up question … Is this approach of boosting the number of training examples of underrepresented genre’s a good practice in general? I ask because I am thinking about a similar approach at work for mapping customers to different ad content. We definitely have situations where there are β€œcategories” of ads that are underrepresented.


Hello Dan! @danielhopkins80

How are you doing? I am sorry that I must have overlooked this thread. How is your ad model going? Without knowing how your model works and how your dataset behaves, I don’t know what’s a better suggestion to make, and can only comment that it’s worth giving it a try and then evaluate. If you would like to, we can take further discussion on this privately.

On the other hand, it’s also worth to let those underrepresented but big categories to gain more exposure to actually learn something from it. In reinforcement learning, we don’t always exploit the best move that our model gives out, but take a random move to explore what we can gain from it. What I am saying is, a balanced strategy is important. Reinforcement learning has to deal with a world full of unknown and without anything a priori and not even data, and your case could be the same.

I am tagging your username here and hopfully you will get notified by a system email.


Unbalanced data can bias the model more towards the over-represented classes. So, if you are dealing with unbalanced data it would be a good idea to boost the data. There are many techniques to balance the data - Data Augmentation is often used.