Hello everyone. I was hoping that someone could provide a bit of clarity on the training datasets in the content-based filtering lab. The lab indicates that "the reduced dataset has ππ’=397 users; ππ=847 movies and 25521 ratings. However, when I look at the user matrix (e.g. user_train) and movie item matrix (item_train), I noticed that the dimensions of those matrices are 50884 x 17.
The column dimension (m = 17) obviously corresponds to the user and item features. But why does the row dimension (n=50884) of these matrices not correspond to the number of ratings (25521)?
Just a quick follow up question β¦ Is this approach of boosting the number of training examples of underrepresented genreβs a good practice in general? I ask because I am thinking about a similar approach at work for mapping customers to different ad content. We definitely have situations where there are βcategoriesβ of ads that are underrepresented.
How are you doing? I am sorry that I must have overlooked this thread. How is your ad model going? Without knowing how your model works and how your dataset behaves, I donβt know whatβs a better suggestion to make, and can only comment that itβs worth giving it a try and then evaluate. If you would like to, we can take further discussion on this privately.
On the other hand, itβs also worth to let those underrepresented but big categories to gain more exposure to actually learn something from it. In reinforcement learning, we donβt always exploit the best move that our model gives out, but take a random move to explore what we can gain from it. What I am saying is, a balanced strategy is important. Reinforcement learning has to deal with a world full of unknown and without anything a priori and not even data, and your case could be the same.
I am tagging your username here and hopfully you will get notified by a system email.
Unbalanced data can bias the model more towards the over-represented classes. So, if you are dealing with unbalanced data it would be a good idea to boost the data. There are many techniques to balance the data - Data Augmentation is often used.