RNN input doubt

Sir in RNN why we give input x1=0 vector , why don’t we give one hot encoding of the first word or any type of input?

sir if rnn comes in supervised machine learning then what is the form of input in rnn .

Please see this. The ellipses refers to number of features per timestep. Eg: [batch_size, timesteps, # features per timestep]

It sounds like you are perhaps mixing up the inputs with the hidden state of an RNN. This is supervised learning and the inputs depend on what kind of sequences you are dealing with. As you say, they are frequently one hot vectors representing things like letters in an alphabet or words in a vocabulary or musical notes in a scale. The labels are whatever it is you are trying to train the RNN to produce, which could be anything from a sentiment classification to the translation of the input into a different language to a musical composition to …

So the inputs (the x^{<t>} values) are not typically zero, but we typically start the training with the hidden state (the a^{<t>} values) as zero.

This is a completely new topic and very different and in some ways even more complicated than the CNNs we learned about in Course 4. I suggest you listen to Prof Ng’s lectures in Week 1 of C5 again with what I said above in mind and see if things make more sense the second time through.

sir, in andrew sir video he said this " for language model,

it’ll be useful to represent the sentences

as outputs y rather than as inputs x." so this is confusing me why the sentences we got from speech recognition aren’t used as input(x)

Hey @Ashish_Sharma6,
Can you please share to which video are you referring to, along with the time-stamps for the same? It will help us provide the context.

Cheers,
Elemento

As Elemento says, it would help for us to see the same portion of the lecture. But if you’re doing speech recognition, then the input is the audio, right? And the output is the text. What you then later decide to do with the output text (e.g. translate it into French) is a completely separate RNN operation. You could have solutions involving several different RNNs to perform parts of a more complex task.

sir in week 1 language model and sequence generation at time 2:32

Hey @Ashish_Sharma6,
To add the context to Paul Sir’s answer from the video, let me provide a breakdown of the video for you. In this video, Prof Andrew discusses about “What are language models, and how to create one?”.

Now from 0:27 to 2:20, Prof Andrew explains a key component of language models using an example of a “speech recognition system”. The key component explained here is that “language models try to output only likely sentences”. After 2:20, the running example is of a “text generating” language model, so make sure to not get confused between a text-generation model and a speech recognition model.

And now from 2:20, Prof Andrew mentioned:

So the basic job of a language model is to input the sentence which I’m going to write as a sequence y^1, y^2 up to y^ty, and for language model, it’ll be useful to represent the sentences as outputs y rather than as inputs x. But what a language model does is it estimates the probability of that particular sequence of words.

Now, to understand this, we just need to understand what a language model does. The answer is simple, IT GENERATES TEXT. What kind of text you ask? The kind of text that the model is trained on, I say. Once the model is trained, during inference, we feed it some starting input, and it generates a whole piece of text from it.

The most common starting input that is fed is a vector of zeros, so let’s say we feed that. Also, let’s assume that the corpus contains text from a rapper’s works, so the model might generate, “Hey yo, it’s your boy, AI_BARISTA!” :joy:

Now, what Prof Andrew mentions in this paragraph, is how we train such a model. The answer is simple once again, IT IS MADE TO PREDICT WORDS. We take samples from our corpus, for instance “I love to learn about AI”, and we feed the first few words from it to the model, say “I love to learn about”, and we make the model predict the next word. This way the model learns how to predict the most likely words, one after another.

Lastly, let’s understand why Prof Andrew said this:

So the basic job of a language model is to input the sentence which I’m going to write as a sequence y^1, y^2 up to y^ty, and for language model, it’ll be useful to represent the sentences as outputs y rather than as inputs x.

You see during training, the sample serve as the true output (or the ground truth), and when we start training, as I said, we start with some starting input, for instance, a vector of zeros. Additionally, during training, the input at time-step t+1 is nothing but the true output at time-step t, so the inputs are formed out of the outputs only, that’s why Prof Andrew said that for a language model, it is helpful to represent the samples with y, instead of x.

Now, don’t worry if you don’t understand some of these things, since Prof Andrew discusses these in great detail in this and the future videos.

Let us know if this helps you out.

P.S. - @paulinpaloalto Sir, please do add your concluding thoughts to this.

Cheers,
Elemento

1 Like