In the video 'Language Model and Sequence Generation", Andrew Ng describes that the output of y(3) is trying to estimate p(y3 | y2, y1). This seems very similar to the more traditional Markov Chain. I’m curious about the similarities and differences between those approaches. Here is my first take of it:
-
A second order Markov model will also provide p(y3 | y2, y1). The difference is that the Markov model tries to estimate the sample probability from the data, where the RNN trains a neural network to predict this probability. Also given the high dimensionality of x/y (10.000 was mentioned for NLP scenario), I gues it would be relatively tricky to estimate all probabilities of different combinations using a higher-order Markov model.
-
A Nth-order Markov model will only give you an estimate of p(y_i | y_(i-1) … y_(i-N) ). The RNN described in the lecture (ignoring vanishing gradient issue for now) can give you p(y_i | y_(i-1) … y_1 ) ), since it passes through information from all previous elements.
-
Combining the two above, you could argue the RRN is actually using the same structure as a higher order Markov chain, the difference being it’s using a neural network for estimating the probabilities rather than using sample probabillites.
What do you think of this comparison? Did I miss something or do you agree? Curious to hear your thoughts!