LSTM architecture

I found an interesting question is not answered. You may not need a super delayed answer :slight_smile: , but I try to summarize my answer here just in case of someone comes to here by a search.

Here is the standard chart to describe the LSTM cell.

And, a question is about the update gate in the center of this picture. It is true that the role of two paths are not clear, and may have a question why sigmoid and tanh are applied separately as both are merged soon. Actually each has important role.

I personally do not like this “standard” picture. :slight_smile: Both cell state and hidden state are inside a cell, but seems to come outside of a cell… Key input and output are X and Y, but not focused much… etc. etc.

So, here is my version to focus on the main flow from X to Y.

In LTSM cell, we have “Cell State” and “Hidden State” which are updated upon a receipt of x^{<t>}. And, main path can be easily understood, i.,e, from bottom to top. I think this can also illustrate the role of 3 gates clearly.

my question relates to the update gate in the LSTM cell, or more precisely: why do we need the update gate at all?

The main path is to get x^{<t>} and calculate \hat{y}^{<t>} with using a cell state c^{<t-1>} and hidden state a^{<t-1>}. In parallel, both cell state and hidden state are updated to c^{<t>} and a^{<t>}.

The input to this cell is created by a concatenation of x^{<t>} and a^{<t-1>}. This is also used to create 3 gate variables with using different weights.

The role of “tanh” in the main path is to arrange input vector to find some relations with trainable weights to be usable by following steps. The output is (-1 ~ 1). Then, the “update gate” plays the key role. It is created from the same vector [a^{<t-1>}, x^{(t)}], but with different weights to focus on the mask, i.e, which elements should be passed and which should be dropped. So, the value of a gate variable is (0~1), and is created by sigmoid.

Then, output from tanh are filtered by output from sigmoid with Hadamard product.
Think about “peephole” which is one of variation (enhancement) of LSTM. In addition to [a^{<t-1>}, x^{(t)}], c^{<t-1>} is concatenated as an input to three gate. With this, we can say that gate variables are created to reflect “hidden state”, “cell state” and “input data” to select best information from “hidden state and new input”.

In short, as answer to your question, the tanh path and the sigmoid path use the same input, but use different weights to create “cell activities” as a main path, and “gate” to filter elements separately.

Hope this clarify some.