Hi! I watched DL specialisation and NLP courses, but still can’t understand how LSTM works under a hood. So, we have gates from words to pipe, that can contain some additional info on base of which any of future computations can appear. Also we prevent of gradient problems. Ok, but how is my algorythm will understand what principles are of selection words and they using in the future? Is it set by gradient to when finding the best W of gates? I can’t imagine how this process is looks like too.
What both GRU and LSTM do over and above a basic RNN is they add additional “gates” that effectively make the “hidden state” learned by the RNN more powerful and expressive. In principle, I suppose a “plain vanilla” RNN without the addition of explicit “update” and “forget” gates could use its hidden state to implement similar behavior. But LTSM and GRU make the function of those subsets of the hidden state more explicit and (apparently) that makes it easier for the usual back propagation based training to learn those complex behaviors. As with everything in any kind of supervised Neural Network, the learning happens by training the network on the training data using back propagation driven by an appropriate cost function so that the learning is meaningful. As with everything, it either works or it doesn’t, right? Meaning that people must have tried using plain vanilla RNNs and LSTM on the same task and LSTM must perform better in some significant types of problems otherwise it wouldn’t be a big deal.
so, how model can understand, should LSTM work with problem of pluralization of verb or adding 's to the end, etc? And shold we manulally programming this gates? If not, how does model understand how do this gates should work?
You should not manually do anything with the gates other than selecting whether you want to use a GRU or an LSTM model. The point is that it learns what works through training. Training a language model requires a lot of labelled data, of course.
so, do you mean, that when we train on huge sentences, model will understand that any of words in the past has plural form by it’s embedding? And after this hypothesis it should understand by itself, that if will be another none-subject in sentense, it should nivelate influence of first none-subject?
How many of training sentences should it be to make model so smart?
There is no magic number, but it’s a lot.
Or to put it in more practical terms: enough that it works. You have to experimentally determine how much that is.
So, do all LSTMs work like this? Do they try to find any correlations with any characteristics of word in the past or maybe combinations of characteristics that should be passed to the future? And all of this for getting desired results, if detected patterns of characteristics of words will be repeated in another sentenses? Can you maybe append my thoughts, because it really looks like magic in first view. It should be really hard work to a model without any expanations… I compassionate
Yes, that is the “magic” of LSTMs. There are lots of other good sources of information and intuition about LSTMs and RNNs in general. I found this article by Andrei Karpathy very helpful. And he also has a number of relevant lectures out on YouTube, e.g. this one.