What does the test data contain if the training data contain all the user ratings?

Based on the above image, I have the following questions:

  1. If the training set includes all the ratings provided by users in the dataset, what does the test set consist of? Additionally, in this scenario, is the test set a subset of the training set, or is it distinct from it?
  2. The image mentions that some ratings are repeated to increase the number of training examples for underrepresented genres. What does the repetition of ratings specifically refer to in this context? Are the user records duplicated, or are the movie records duplicated to address the issue of underrepresentation?

The test set is a set of examples that you did not use during training. These are used to verify how well your system makes predictions on examples it has never seen before.

Sometimes a data set will be “augmented” to artificially make the data set larger, without the cost of collecting more data. Often this will consist of resizing, rotating or mirroring images.

Thanks for your response. The lab notes mention that the training set includes all the ratings made by users in the dataset. However, my understanding is that a user’s ratings should be split between the training and test datasets. How is it that the training set contains all of a user’s ratings?

I’ll review the assignment in more detail and report back later.

1 Like