Week 1 Assignment 3 music_inference_model

I am getting an error on this line:

in music_inference_model(LSTM_cell, densor, Ty)
38 for t in range(Ty):
39 # Step 2.A: Perform one step of LSTM_cell. Use “x”, not “x0” (≈1 line)
—> 40 a, _, c = LSTM_cell(inputs=x, initial_state=[a, c])
41
42 # Step 2.B: Apply Dense layer to the hidden state output of the LSTM_cell (≈1 line)

which eventually complains about expecting more inputs:

ValueError: Layer lstm expects 5 inputs, but it received 3 input tensors. Inputs received: [<tf.Tensor ‘input_8:0’ shape=(None, 1, 90) dtype=float32>, <tf.Tensor ‘a0_7:0’ shape=(None, 64) dtype=float32>, <tf.Tensor ‘c0_7:0’ shape=(None, 64) dtype=float32>]

I’ve used this exact syntax to call lstm_cell in the previous exercise with no issue, so what am I missing here?

Also why are we sampling using the max prediction? I thought we sample using the output prediction as a probability vector to sample the next value. Don’t we always get the same sample now?

I suppose the problem is the step to convert one-hot vector to the required tensor which should be passed to LSTM_cell at the 2nd iteration. So, please check a syntax of RepeatVector.

If an error still occurs after fixing RepeatVector syntax, please just try to reset LSTM_cell that you executed far above…

LSTM_cell = LSTM(n_a, return_state = True) # Used in Step 2.C

Still getting the same error after re-running the cell defining lstm_cell. Code as follows which I will delete later as I am out of ideas:

You did not create a one-hot vector correctly.

tf.one_hot(
indices,
depth,
on_value=None,
off_value=None,
axis=None,
dtype=None,
name=None
)

If you set depth = 1, then, it may not be called on-hot anymore, since there is only one element. :slight_smile:
You need to set a correct depth, which should be same as the shape of input values.

As you are aware of, please remove code.

Thanks! That would’ve taken me a long time to debug since I assumed I understood one_hot but apparently not. In general I find the official TF documentation to be a mixed bag. It’s not terribly useful to have “depth of one hot vector” be the description of a parameter called “depth” with no further clarification. It didn’t help either that all examples used “depth=3”.

Can you also address my other question? Why are we sampling using the max element of out? I thought we use out as a probability vector to sample the next y_t according to the lecture and the dinosaur name assignment.

I think a trick is;

n_values = 90 # number of music values
reshaper = Reshape((1, n_values)) # Used in Step 2.B of djmodel(), below
LSTM_cell = LSTM(n_a, return_state = True) # Used in Step 2.C
densor = Dense(n_values, activation=‘softmax’) # Used in Step 2.D

Those are defined as global variables.

Then, as you remember, we trained these with djmodel. (You find that all are passed to djmodel() as parameters.) The label is, Y, which is time-shift of X, at that time. With this, this network can learn what can be the next value (note) with a loss function of “categorical_crossentropy”.

Then, we use these “trained” global variables" for building “music_inference_model”. You see “LSTM_cell” and “densor” are passed in here.
This trained “LSTM_cell” generates the next value (note) based on the input value (note), as it was trained in djmodel. So, what we need to do in this inference model is to pick one from the output (with tf.math.argmax). (For the next iteration, we converted it to one-hot vector, then passed to next LSTM_cell as well.)

That’s the flow.

I personally think this is not a good way for a programming assignment. Even a learner successfully trained a model, and get “trained” LSTM_cell, it can be easily broken during an implementation of “music_inference_model()”, since it is a global variable. Once we fail in implementing “music_inference_model()”, even if we successfully correct bug, it may not work, since this global variable is broken. We will keep getting error message, just like you saw.
So, the key point for “debugging” is the error that you encountered at first. That’s the real error message. Since then, you use broken LSTM_cell, you will be keep seeing error messages which are sometimes not related with your real bug. This makes a problem determination difficult.

Anyway, the first half is for your question, and the 2nd half is to guess why your debugging took time.

Hope this helps.

I forgot to add one thing.
For programming and debugging purpose, I introduced the way to reset LSTM_cell.
This is definitely help you to implement right code for that cell. But, of course, LSTM_cell is not a learned model anymore, once you reset it. So, once everything is fine, then, it is better to go through all cells to get better result.

But, actually, no one knows what is the best output, i.e, from Jazz view point, though… :wink:

I appreciate the response and I agree this assignment feels off in terms of design compared to the other assignments throughout the specialization. Seeing that other thread about pedagogy for week 4, it feels like this will be the theme throughout the course.

On the output generation note, what I am getting is we could’ve also generated the next y_t using the random sampling method described in lecture, that is sampling according to the softmax output probability instead of just choosing the element with the max probability. So basically the assignment chose to select the note with the max probability for better “artistic” expression? I mean it’s not like the final output I got is anything remotely close to decent jazz so I find it strange that the assignment would use this sampling method at the cost of confusing students trying to solidify understanding of the lecture via assignments.

Hey @abelian_group_chen,
I guess what you have said is indeed the answer

Since I went through the entire assignment, and couldn’t find a single reference as to why they have done what they have done. And I guess it would be better for learners if this assignment uses the same methodology of sampling from a probability distribution instead of using the argmax and then choosing that value.

Another way out could be a simple note that tells the learner about this fact, but since, using the most probable value doesn’t have any audible differences in the outputs, I would personally prefer the first way out.

What do you think @anon57530071?

Regards,
Elemento

I see no big differences among two. In this exercise, we want to pick 1. The way to pick one can be using argmax to pick up the maximum possible one or sampling 1 from a given probability distribution. Even for text generation, I saw both.
One of reasons that I’m OK with either, is, either is not enough. :wink:
For a practical use, we may need to care more music flow like verse/chorus (context in sentence generation), not a single note generation. The attention mechanism may work better in here.
From the coverage view point, a pattern to use argmax may be OK, since both Dinosaurus and Shakespeare exercise uses random.choice from a probability distribution.

1 Like

I should also mention that the quiz from week 1 essentially told us it’s wrong to sample using argmax, so I find it confusing (and no doubt for other students) that we are implementing argmax sampling in assignment 3, especially after having just implemented sampling in assignment 2 using random sampling.

Hey @abelian_group_chen,
I am assuming you are referring to C5 Q1 Quiz Q5. If you take a close look at the question, you will find that it states “are using it to sample random sentences” explicitly. This necessitates the use of random sampling according to the probability distribution generated by the softmax, but no where in the assignment it has been mentioned that we are using the model to predict random sequences.

So, I guess, it has been done in this assignment from the point of view of maximum possible coverage, i.e., the learners should know of different ways that a trained RNN model can be used to generate sequences, be it text, sound, video, etc.

Regards,
Elemento

I think that’s partly the issue with this assignment, the fact there is little context to what it’s trying to do. It doesn’t say we are sampling random sequences (I thought we are, otherwise how do I get the sequence at the end of the assignment? My current result sound very different). But it also doesn’t say why we are doing argmax sampling. There is zero context and maybe there will be others like me who find this to be in contradiction to the lecture material.

I think @Elemento covers mostly.
If I remember correctly, Quiz focused on RNN. In that case, if we do not pick a (random) sample from a probability distribution, then, the result may be a repetitive sequence. In the case of LSTM, we have cell state and hidden state. So, the risk to be a repetitive sequence may be smaller than simple RNN. (Remember that Diansaurus exercise uses RNN. So, we need a special attention to that point.) That may be the reason why this exercise used argmax.