Recommender System NN Question

I’m playing around with different ratings for the ‘new user’ in cell 34 and noticed something interesting:
If I set only the new_documentary = 5.0 (indicating the user only gives high ratings to documentaries), the model returns few to no documentary movies for it’s predictions. This is surprising to me.
I see that the dataset only contains 13 documentaries, they are the least represented of any of the genres. If I set the only rating to comedies, it does seem to predict more expected recommendations.

I’m trying to apply the diagnostics from the previous course here - so I tried creating a cross validation split:

item_train, item_test = train_test_split(item_train, train_size=0.80, shuffle=True, random_state=1)
item_test, item_cv = train_test_split(item_test, test_size=0.50, shuffle=True, random_state=1)

user_train, user_test = train_test_split(user_train, train_size=0.80, shuffle=True, random_state=1)
user_test, user_cv = train_test_split(user_test, test_size=0.50, shuffle=True, random_state=1)

y_train, y_test = train_test_split(y_train, train_size=0.80, shuffle=True, random_state=1)
y_test, y_cv = train_test_split(y_test, test_size=0.50, shuffle=True, random_state=1)

My Jtrain is 0.0713, Jtest = 0.0826, Jcv = 0.0804. I thought that Jtest should generally be less than Jcv?

Anyhow, my understanding is that these costs indicate high bias. This means we should try to get more features, add polynomial features or decrease regularization.

Am I getting this right? Any suggestions on getting more documentary recommendations?

Tried doubling the NN size, didn’t help much, with the problem, but obviously the Jtrain decreased.

Hello @forrest,

I don’t know how we can change the model in order to make good suggestions to that hypotheical user with a 5-star in documentry and 0-star for everything else. However, I am writing here because I would like to suggest that we temporarily shift our focus from action items to analyzing more about the problem itself. We have a lot more to do here.

You have definitely made a good point that the number of documentry is small, and I have also verified it myself because we should know about it. OK, so we already know something about the data, then what about the model? What do we know about its effect on the matter?

If we look carefully at the model formulism again, we can see that we are optimizing it to dot (the mathemical dot product) an existing user to an existing item so that the dot product outcome is the user’s rating to the item. This is certainly a over-simplified statement, but it provides enough ideas on how we can look at the data again:

So we need to ask ourselves these:

  1. Does our hypothetical user exist in the training dataset?
  2. If exist, did they rate those 13 documentry highly?
  3. If not exist, did other existing users who are CLOSE to the hypothetical user rate highly any documentry item? (and btw, how do you think we can find CLOSE users using the model?)

We definitely want to verify these, because if none of the answers to my Q2 and Q3 is a YES, then it is unlikely that our recommender can make documentary suggestions to our hypothetical user.

I have taken out an existing user who had given some documentry a 5-star rating, and these recommendations are not unsuprising, right? Because this is how the model works.

image

Again, I don’t know how to improve the model for our hypothetical users, but if you have some ideas and you have some experiment results to share, we are happy to discuss them!

Cheers,
Raymond

1 Like

Hey Raymond, thanks for taking the time to help out! I think I’ll need to dig a little deeper and ask some clarifying questions. Please bear with me as I try and test out some of what I’ve learned so far. This has been a really helpful exercise for applying the lessons.

Let me start with the baseline performance - I would think that human baseline performance would be fairly straightforward in the case of recommending movies to a user who only rates one genre a 5.0 and all others 0.0, I would just suggest only documentaries to that user. Obviously with a more mixed set of ratings that baseline performance would be lower.

I wasn’t able to find any users that only have 5.0 ratings for documentaries with all 0.0 ratings for the other genres so Q1 is a no.

Also coming from the movie data side, some movies are described as only documentaries and nothing else, but many of the movies are documentaries and some other genre. I imagine that that would influence the model, since these docs are ‘mixed’ with other genres that might be more represented in the data. (comedy, drama)

I do have an idea for a feature that potentially would be able to make the model perform better:
For both users and movies, compute a ‘purity’ score, i.e. Users who’s ratings skew heavily to one genre would have a higher purity score (documentary purists). A similar score could be computed on movies, probably not too many movies have more than 3 genres, but a bunch have 2, etc.
The idea here being that the dot product of a user with high purity, liking documentaries and a documentary that is purely a documentary would minimize the cost for those examples.

Am I way off base here?

Hello @forrest,

I hope you wouldn’t mind me being a bit critical on your comments.

This is very reasonable. If we forget about the model in the assignment, and if you ask me to build something like that, then I definitely would not consider the assignment’s model. Because the baseline approach is completely different from the assignment’s approach.

Of course, I am not saying that you brought the baseline up because you thought this was how the assignment’s model works, but I am saying it this way because I want to make sure we are 100% clear that they are 2 different approaches.

If I were to build something for the baseline approach, I wouldn’t even need the rating scores. Only the genre information is enough. Consider an user with 5-star for documentary and 0 for else, in the baseline approach, we definitely recommend documentary. However, given that the user had rated one and only one documentary as very bad, a trained model following the assignment’s approach will definitely not likely to recommend any documentary. You see, the baseline is about genre, and the latter is about the user’s previous interactions with the movies. They are different.

Those genres represent a movie. They set a movie apart from another movie, but they do not directly determine which movie is liked by what kind of users, regardless we have a 5 in an user’s documentry category and a 5 in a movie’s or not. The previous rating values determines that. The higher the rating, the more likely an user of this kind will like the movie of that kind.

When you say “ratings skew heavily”, did you mean that “the number of rating skew heavily”? If so, then I think the mode of thinking was in the “baseline” approach, but that would not get us understand the assignment’s approach, if it is the main goal of our discussion now.

I would say, let’s first be clear about the difference between the “baseline” approach and the “assignment” approach. Do you have any new thoughts after this?

Raymond

Not at all, thank you for helping out. I can re-state my idea of the baseline in a different way, hopefully testing my understanding of the assignment model. I want to understand why the assignment model might not perform so well on the hypothetical case, and how we might improve it.

Let me imagine the “human brain” algorithm for recommending you a movie, given your average genre ratings and your rating scores. If your average ratings very high for 1 genre, I would give you n recommendations only from that genre, 2 genres 50% of each, and so forth. Ignoring for now the selection of which movies from each genre, I’ll think on that in a bit.

At the upper bounds, you only like one genre, and the cost J of my human algorithm is essentially based on how much you like it (and how many high ratings you’ve given it). For lack of better terms the signal to noise ratio here is strong.

This scenario is the ‘lower bound’ of my human algorithm. I would pick you recommendations from the users who are most similar to you (don’t like anything really). The signal to noise ratio is still high, but the signal is very weak. Actually the case of you giving ratings of 5.0 across the board for every movie would also perform really poorly on my human algorithm. It would be too hard to give you things you actually like. The here the signal is very high, but the signal to noise ratio is low.

Which recs I give you is based on similar users dot product similar movies (the assignment model). This is same as the “human algorithm”, where I would look at let’s say the top 10 similar users and pick from their top 10 picks, aggregated somehow (the model already does this well). If I understand, the cost of the assignment model will be lowest when my ratings closely match the ratings of some number of other users (the more closely the lower the cost), AND we’ve rated the same movies similarly.

Here’s my attempt at formalizing the above thoughts. A user’s ratings can also be characterized by their distribution. The signal to noise ratio I described above, for lack of a better statistical term, could be a unimodal distribution. The best distribution of your rating for the human algorithm is when your ratings have a high signal to noise ratio. The assignment model, however doesn’t necessarily care about the distribution of the ratings for a user. As long as the your ratings are similar to other users it’s happy to make a prediction.

So how could we make the model perform better for this hypothetical, weak rating case? If it can perform well on the weak signal case (only rating a single doc 1.0 and everything else 0.) it can probably perform well on a strong signal case (only rating a single doc 5.0 everything else 0.). If we were to collect a lot more user ratings, we’re more likely to get more weak and strong signal cases, but they will likely still be a small probability in the data, and the model won’t learn that very well. That was my idea behind using an engineered feature which captures the signal to noise, or distribution of the user’s ratings. Given two users with different overall ratings but similar distributions:
ratings_distribution

Now even if we have a weak signal case, if I understand correctly, the model would have a lower cost if an engineered feature could extract this “signal” from the preferences of other users.

In general, when the data set is highly skewed, creating a validation set is problematic. There aren’t enough Documentary films in this data set, so splitting off some for validation just makes the training problem worse.

Hello @forrest,

Thanks for this follow-up. It helped clear up something I might have misunderstood last time.

That is the human algorithm.

For the assignment’s algorithm, if I only ever rated in 1 genre (and rated highly), then, by definition, all movies I have rated belong to that and only that genre. In this case, the recommender should also recommend movies of ONLY that genre to me, NOT movies of mixed genres. (Note that, as you have confirmed, in the assignment’s dataset, no user belongs to “5 in documentry, 0 in else”. Given no precedents, it’s hard to predict what the recommender would recommend.

Would this behavior be different from the human algorithm?

I think the signal-to-noise (SNR) analogy is very interesting. I can picture it this way

I think the case of broad interest / no interest / specific “uninterest” are bad in terms of SNR.

It is difficult to predict what the assignment’s algorithm is going to recommend by just looking at the SNR. It could be easily for the case of high SNR, but definitely not the case for low SNRs.

Yes, that sounds like the idea behind the assignment’s algorithm. The assignment’s algorithm is able to compare the similarity between user-and-user, user-and-movie, or movie-and-movie. So your idea of going from me to a similar user, to their similar movies, to my recs is pretty the assignment’s way.

When we talk about cost, we are comparing between the truth and the prediction. For completeness, I would change a bit of your sentence into

the cost of the assignment model 's recommendation for me will be lowest when my ratings closely match the ratings of some number of other users (the more closely the lower the cost), AND we’ve rated the same movies similarly.

I hope I have not altered too much of what you originally wanted to express, otherwise, please let me know.

We happened to think in the same way :handshake:

The model actually converted the distribution of the ratings in different genres into distribution of some abstract scales in different abstract dimensions. It also actually converted both user and movie into the same set of scales and dimensions for similarity comparison.

However, I agree with that it will still be happy to make a prediction. For example, I and another user are both users with little interest in any movie, then the recommender could end up recommending a movie that was highly rated by that user. It is just similarity. It makes no judgement on whether basing the recommendation on that user is a good move or not.

I think the biggest problem here is that the training set didn’t have any user like the hypothetical case. If it had, then the assignment’s model would have been able to recommend those users’ movies which were likely to be documetry. Do you agree? If so, would this be a lead?

The assignment’s model should NOT recommend doc movie to users “(only rating a single doc 1.0 and everything else 0.)”, but it should recommend doc movie to users “(only rating a single doc 5.0 everything else 0.).”

It depends on what samples you want to collect. Following my previous potential lead, if we were lack of users who were “(only rating a single doc 5.0 everything else 0.)”, then should we just collect more users of this type?

I assume your x-axis is genre. Again, as I said before, it is pretty hard to discuss the model’s response based on SNR. A reason behind is that, as I said previously,

The model actually converted the distribution of the ratings in different genres into distribution of some abstract scales in different abstract dimensions. It also actually converted both user and movie into the same set of scales and dimensions for similarity comparison.

We human want to make comparison using genres, but the assignment’s model wants to use the abstract dimensions. We don’t know how your plots will get converted to in the abstract dimensions. It is therefore very difficult to discuss the model’s improvement with the graph.

@forrest,

In order not to get lost in a lengthy and branched-out discussion, I want to emphasize on just one of my previous points:

It all comes down to one idea. Precendent cases, yes or no, have or not have.
In the assignment’s training set, the answer is “not have”.

In my previous reply, I showed that the recommender was able to pull relevant recommendation for a type of user that has existed in the dataset:

Should we consider to add users based on what we want to do better?

Cheers,
Raymond

@rmwkwok Thanks again for the discussion. It’s been very helpful in helping my understanding of the content.

This true if that user is in the training set, but the idea is for making it work for a new user. I have re-produced the predictions again:

new_rating_ave = 5.0
new_action = 0.0
new_adventure = 0.0
new_animation = 0.0
new_childrens = 0.0
new_comedy = 0.0
new_crime = 0.0
new_documentary = 5.0
new_drama = 0.0
new_fantasy = 0.0
new_horror = 0.0
new_mystery = 0.0
new_romance = 0.0
new_scifi = 0.0
new_thriller = 0.0
new_rating_count = 50

user_vec = np.array([[new_user_id, new_rating_count, new_rating_ave,
                      new_action, new_adventure, new_animation, new_childrens,
                      new_comedy, new_crime, new_documentary,
                      new_drama, new_fantasy, new_horror, new_mystery,
                      new_romance, new_scifi, new_thriller]])
Out[30]:

y_p movie id rating ave title genres
0.8 168252 4.3 Logan (2017) Action Sci-Fi
0.8 6502 4 28 Days Later (2002) Action Horror Sci-Fi
0.8 111759 4 Edge of Tomorrow (2014) Action Sci-Fi
0.8 79132 4.1 Inception (2010) Action Crime Drama Mystery Sci-Fi Thriller
0.8 7361 4.2 Eternal Sunshine of the Spotless Mind (2004) Drama Romance Sci-Fi
0.8 58559 4.2 Dark Knight, The (2008) Action Crime Drama
0.8 54995 3.8 Planet Terror (2007) Action Horror Sci-Fi
0.8 27660 3.7 Animatrix, The (2003) Action Animation Drama Sci-Fi
0.8 55721 4.3 Elite Squad (Tropa de Elite) (2007) Action Crime Drama Thriller
0.8 60684 4 Watchmen (2009) Action Drama Mystery Sci-Fi Thriller

Working well on users in the dataset (or like ones in the dataset) and not well to new users means we have a problem with generalization, right?

I understand that as we scale the data and normalize the inputs, the plots won’t “look” to the NN like they look to us. Still, can’t we use an engineered feature that will influence the model’s performance? That’s basically the final question in my line of questioning. I tried experimenting with a few engineered features, but only succeeded in lowering the model performance :slight_smile:
Suppose that new training examples never actually contain the ratings that would encourage our model to perform the way we want it to – feature engineering is a way to fix this, right?

We can say it is a generalization problem, but “generalization problem” is a very large umbrella that can cover a lot of different causes which requires very different solutions. I said, a likely cause of such generalization problem is that our training set might not have any users that are close to the hypothetical user. I have demonstrated the linkage between the lack of training set to the problem.

Let’s do some thought experiments to test how generalization works:

  1. We collected food preferences from 10,000 Asians, on which we build a model. Can we use it to recommend food for Europeans?

  2. We collected 10,000 users aged 70 - 100 years old for their movie preferences. Is that dataset our first choice to make predictions for teenagers?

There are common factors between my above cases and our assignment case: (1) they all have some data (2) we can build models on those data (3) we will make bad predictions on some hypotheical users (4) we may blame those model bad at generalizing.

Let’s ask ourselves:

  1. Can we expect a ML model to generalize itself to a new European user based on Asians’ data?

  2. Can we expect a ML model to generalize itself to a new teenager user based on the seniors’ data?

  3. Can we expect a ML model to generalize itself to a new documentry user based on data that does not cover that kind of user or any kinds that are even close?

  4. Lastly, @forrest, if we did not reveal the names of the user’s and movie’s features, meaning that instead of telling you it is “documentry”, that is “comedy”, blah blah blah, we tell you it is “12”, and that is “26”, and just some random, meaningless, coded names, and they are NOT one-one corresponded between user and movie.

    • Would you still be wondering why the assignment’s prediction isn’t satisfying?

    • Would you challenge the model like the following: “I have a new hypotheical users with feature code “24” as 5.0, and other features as 0.0. However, it does not recommend me any movie that has feature “126” being set as 1. Why is that?”

    • Note that the above questions are very reasonable, because the model never ever considers the name of the features. To the algorithm, it gives you the same model no matter what names you give to the features. If we want to compare between human and machine, human should not leverage something that the machine does not. Fair game, right?

    • Now my answer is, I would not challenge it like that. I don’t even have a single justification for relating user feature “24” with movie feature “126”. You could do it here just because you know they are documentry. Right? If you don’t know it, how are you going to relate user feature “24” and movie feature “126”? Please do think about this. My answer is, by data!

So, here comes my arguments.

It is only the data that even we human can resort to, if the feature names are all coded. Not to mention that, our ML model can ALWAYS ONLY find relations in data. The algorithm never reads the feature names. Therefore, if our data never shows any high ratings between a user of 5.0 in documentry AND a documetry movie, and, if our data does not even have any rating records that are close to what we want to predict for, how can we imagine that the model must discover that?

Human can discover that, because human knows “documentry people like documentry movie”. Human learns this from our social experiences. We did not embed such “social experience rule” into the model, then how can the ML model do something that we human can?

Machine works under human’s instructions, did we build in that rule into the algorithm?

Another way of asking this question is that, can you embed that “social experience rule” into the model? It is a real question. Maybe it is possible. :slight_smile:

Can we engineer some features using the dataset of people aged 70-90 so that it must, undoubtedly, dramatically help predicting users aged 10-20?

Can we engineer some features using the dataset of people aged 70-90 we collected today, so that, it must, undoubtedly, dramatically help predicting users aged 10-20 born 200 years later?

Where is the limit between “we can” and “we cannot”? How far do we think our data can extrapolate into (different age segment, the future, or different user types)? How much can we expect from our limited data to be engineered into explaining something we have even never seen before? Would it be that they are only possibly good for intrapolation, though this is not even guaranteed?

Raymond

1 Like

As I said in the beginning, I don’t know how to make that improvement. Our discussion might lead to some clues, whether or not they are in line with your initial idea, and it would be up to you if you would like to expand them and formalize them as some proposals, and test them out.