Content-Based Filtering - Training Data

danielhopkins80 · November 3, 2022, 4:25am

Hello everyone. I was hoping that someone could provide a bit of clarity on the training datasets in the content-based filtering lab. The lab indicates that "the reduced dataset has 𝑛𝑢=397 users; 𝑛𝑚=847 movies and 25521 ratings. However, when I look at the user matrix (e.g. user_train) and movie item matrix (item_train), I noticed that the dimensions of those matrices are 50884 x 17.

The column dimension (m = 17) obviously corresponds to the user and item features. But why does the row dimension (n=50884) of these matrices not correspond to the number of ratings (25521)?

Thanks,
Dan

rmwkwok · November 3, 2022, 5:20am

Hello Dan,

Quoted from the assignment:

Some ratings are repeated to boost the number of training examples of underrepresented genre’s.

If you de-duplicate the data, you get 25521 unique rows.

Raymond

PS: Your question reminds me of the possible cause about a previous question from another learner, thank you!

danielhopkins80 · November 3, 2022, 2:17pm

Thank you! Very helpful.

Just a quick follow up question … Is this approach of boosting the number of training examples of underrepresented genre’s a good practice in general? I ask because I am thinking about a similar approach at work for mapping customers to different ad content. We definitely have situations where there are “categories” of ads that are underrepresented.

Dan

rmwkwok · December 27, 2022, 2:47am

Hello Dan! @danielhopkins80

How are you doing? I am sorry that I must have overlooked this thread. How is your ad model going? Without knowing how your model works and how your dataset behaves, I don’t know what’s a better suggestion to make, and can only comment that it’s worth giving it a try and then evaluate. If you would like to, we can take further discussion on this privately.

On the other hand, it’s also worth to let those underrepresented but big categories to gain more exposure to actually learn something from it. In reinforcement learning, we don’t always exploit the best move that our model gives out, but take a random move to explore what we can gain from it. What I am saying is, a balanced strategy is important. Reinforcement learning has to deal with a world full of unknown and without anything a priori and not even data, and your case could be the same.

I am tagging your username here and hopfully you will get notified by a system email.

Cheers,
Raymond

shanup · December 27, 2022, 8:59pm

Unbalanced data can bias the model more towards the over-represented classes. So, if you are dealing with unbalanced data it would be a good idea to boost the data. There are many techniques to balance the data - Data Augmentation is often used.

Topic		Replies	Views
C3_W2_Assignment 2_Content based filtering Unsupervised Learning, Recommenders, Reinforcement week-2	2	388	October 25, 2023
C3_W2_RecSysNN_Assignment dataset questions Unsupervised Learning, Recommenders, Reinforcement week-2	9	558	February 27, 2023
Week2- Recommender systems Unsupervised Learning, Recommenders, Reinforcement feedback	1	107	June 10, 2024
Collaborative Filtereing Topic and Lab......(Some Confusions) Unsupervised Learning, Recommenders, Reinforcement	0	261	December 20, 2023
A doubt in C3_W2_RecSysNN_Assignment Unsupervised Learning, Recommenders, Reinforcement week-3	2	421	July 5, 2023

Content-Based Filtering - Training Data

Related topics