Loss function of RNNs

Hi, during week 1 of Course 5, I started wondering how the loss function is calculated since there is no perfect text generation for example, it’s subjective. I then learned that it was human-reviewed. Which is why, during the 3rd assignment, I wondered: Why are we using humans to review it when we can use an already existing RNN to judge it? For example, a Python code extracts from Spotify the rating associated with many pieces of music which are then used to train a many-to-one RNN musical judge, which will then analyze what our music generation’s many-to-many model creates. But the idea seems kinda obvious so, are there any reasons why we can’t do it, like computational power?

This is all “supervised learning”, so there needs to be labelled data and a loss function, in order to drive the training. If you look at the various examples we’ve seen so far in DLS C5 W1, they all use softmax and cross entropy loss. That then drives the training based on the labelled training dataset. For cases in which there is a subjective component to the generated result (e.g. the Dinosaur Name case or the Improvising a Jazz Solo case) you can look at or listen to the results (predictions the trained model makes on test data) and then decide whether you think they are pleasing or as good as you would have hoped. If not, then you can consider how to improve them, e.g. by including more or different training data or changing some of the hyperparameters (size of hidden state, whether you used GRU or LSTM, which optimization method to use …).

Okay, Thanks for the explanation! It was really helpful. Appreciate your support!

Glad it helped! Although in reading your original question again, I didn’t really address the last part of it. Your proposal for how to build a “judging system” to evaluate the results sounds potentially interesting, but it has more the flavor of a “recommender” system. That topic isn’t really covered in DLS, but I think there is a section on recommender systems in MLS. That’s a newer specialization and I haven’t taken that yet. This is just intuition and I’ve never tried anything like what you are suggesting, but my guess is that the information that Spotify users give high ratings to a given music track probably is not “fine grained” enough to drive the training of something like our Jazz Improvisation model. Our existing training methodology does take advantage of the power of an RNN to look at and evaluate the actual sequences that happen, so it learns to imitate the training data that it is fed. Just saying 88% of people like this song is pretty low resolution, compared to looking at sequences of notes that some actual musician performed (and which you are saying are worthy of imitating by including them in the training set).

2 posts were split to a new topic: Meaning of summation variables i and j