RRN sample generation vs. Markov chains

In the video 'Language Model and Sequence Generation", Andrew Ng describes that the output of y(3) is trying to estimate p(y3 | y2, y1). This seems very similar to the more traditional Markov Chain. I’m curious about the similarities and differences between those approaches. Here is my first take of it:

  • A second order Markov model will also provide p(y3 | y2, y1). The difference is that the Markov model tries to estimate the sample probability from the data, where the RNN trains a neural network to predict this probability. Also given the high dimensionality of x/y (10.000 was mentioned for NLP scenario), I gues it would be relatively tricky to estimate all probabilities of different combinations using a higher-order Markov model.

  • A Nth-order Markov model will only give you an estimate of p(y_i | y_(i-1) … y_(i-N) ). The RNN described in the lecture (ignoring vanishing gradient issue for now) can give you p(y_i | y_(i-1) … y_1 ) ), since it passes through information from all previous elements.

  • Combining the two above, you could argue the RRN is actually using the same structure as a higher order Markov chain, the difference being it’s using a neural network for estimating the probabilities rather than using sample probabillites.

What do you think of this comparison? Did I miss something or do you agree? Curious to hear your thoughts!

1 Like

Hi @RvdV, thanks for your question and welcome to the community. You touched upon an interesting question, and your observation I think is correct.

In principle, Markov chains and Neural Networks are completely different. A Markov chain is a mathematical system that undergoes transformation from one state to another, with the sequence of states is characterized by a Markovian state transition probability, ie the probability of occupying the next state only depends on the current (and potentially previous m) states. A neural network is just a (potentially nonlinear) parametric function which maps a d-dimensional input space into a k-dimensional output.

But there is a relationship. You can try to learn the state transition probability matrix for a Markov chain (e.g. Hidden Markov Models) by estimating the weights of a multi-layer neural network. There has been done research in this area, as well as in integrating concepts of Markov Chains into neural networks itself. Note that in addition to the state transition probabilities, a neural network is capable to learn a lot of other parameters.

Of the neural network architectures that we have seen in the course, I agree that the RNN looks most similar in structure to the Markov chain.