GRU & LSTM: why not simply use skip connections?

Why do we need GRU & LSTM when we already have skip connections to handle the problem? That is, if the network is too deep, one can a skip connection from each node to several nodes (say, every 10 levels) further in the network.


I know its been some time since this was posed, but I have the same question! They seem to have a similar function. My only guess is a skip layer, one, does not seem to have a temporal component but two, it isn’t really toggled on and off. These LSTM/GRU units seem a bit more dynamic between each step. Functionally they appear pretty similar…but maybe that’s the origin of the unique naming?