I did not get the sense of the LSTM layer.
Without LSTM I set as input a encoded sentence, e.g. “The sky is blue”.
This is used as input as one sequence at once, e.g. [‘2’, ‘5’, ‘3’, ‘1’].
With LSTM it is the same input at one time step, right.
How is a 2nd time step filled?
Or are the tokens divided in e.g. 4 time steps with t1=[2], t2=[5], …?
The simple example is that at each epoch, you’re trying to predict the next letter in the sequence.
So the input and the output are the same data, but the output is shifted one step into the future.
I am not familiar with the material in NLP C3, so I don’t know how they present this material or what the assignments look like. I have taken DLS C5, which covers Sequence Models, including everything from base RNNs to Attention Models.
Tom has answered the fundamental question of how the data works on a timestep basis. But if what you are asking is how is it different if you try to solve the same sequence model problem with a simple RNN versus an LSTM, then the answer (at least as I understand it) is that there is no difference in how you handle the timestep data in the two cases. The difference between an LSTM and a plain vanilla RNN is that the “hidden state” of the node is a great deal more complicated and expressive. That enables an LSTM to more easily learn sophisticated relationships and dependencies between earlier timestep values and later ones. There is no way to prove that a simple RNN couldn’t learn the same thing in theory, but the point is that creating explicit mechanisms like the “update” and “forget” gates makes it easier for the LSTM model to learn effective behavior.
Thanks for you explanations.
My current understanding: When using the trained model for inference with one single input sentence, e.g. “The sky is blue”, then this 4 words with additional padding are set directly to the model input nodes.
Model e.g.:
tf.keras.Input(…),
tf.keras.layers.Embedding(…),
tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(…)),
tf.keras.layers.Dense(…)
Without the LSTM, all the Embedding-layer-output is connected to the Dense-layer-nodes and used in the Dense-layer in one calculation step.
With LSTM I would expect the same: all data is set as input for LSTM in one single calculation step. So I see no connection between the single words from the sentence.
Or (my understanding now): the LSTM calculates it’s 1st node and the result is used as additional input in the 2nd node? This realizes a connection between the single words. But all this is done with this single input sentence.