In the video 'Language Model and Sequence Generation", Andrew Ng describes that the output of y(3) is trying to estimate p(y3  y2, y1). This seems very similar to the more traditional Markov Chain. I’m curious about the similarities and differences between those approaches. Here is my first take of it:

A second order Markov model will also provide p(y3  y2, y1). The difference is that the Markov model tries to estimate the sample probability from the data, where the RNN trains a neural network to predict this probability. Also given the high dimensionality of x/y (10.000 was mentioned for NLP scenario), I gues it would be relatively tricky to estimate all probabilities of different combinations using a higherorder Markov model.

A Nthorder Markov model will only give you an estimate of p(y_i  y_(i1) … y_(iN) ). The RNN described in the lecture (ignoring vanishing gradient issue for now) can give you p(y_i  y_(i1) … y_1 ) ), since it passes through information from all previous elements.

Combining the two above, you could argue the RRN is actually using the same structure as a higher order Markov chain, the difference being it’s using a neural network for estimating the probabilities rather than using sample probabillites.
What do you think of this comparison? Did I miss something or do you agree? Curious to hear your thoughts!