Long Short Term Memory

The general Neural Network can only use independent data, it do not have the “memory”.

The first approach is to add a recurse in net. The output will still contain the information of the input. This called Recursive Neural Network (RNN), but it always leads to a Gradient Vanishing.

Here $C_{t-1}$ and $C_t$ is the Cell state to let information pass through with unchanged.

But for RNN, learning long-term dependencies with Gradient Decent is hard: http://ai.dinfo.unifi.it/paolo//ps/tnn-94-gradient.pdf

Then LSTM is designed as a special RNN and has 3 gate to control the cell state. It has 输入门，遗忘门，输出门.

Forget Gate

$h_{t-1}$ is the output of the last cell(the whole green box above).
$x_t$ is the input of the current cell
$\sigma$ is the sigmod function: result is between 0 and 1, 0 means all forget and 1 means keep all.
$f_t$ is the result of the sigmoid function.

Input Gate

Sigmoid decide which information need to be update. $i_t$ is the result of the sigmoid, so we can use $i_t\cdot \tilde{C_t}$ to be the scaled value which means how much we decided to update each state value. Similarly, $f_t\cdot C_{t-1}$ is how much we want to keep from the previous state.

Tanh layer is to update the old information.

$C_t$ is the Cell information or Cell state. and $C_{t-1}$ has been updated to $C_t$
Then we use $f_t\cdot C_{t-1} + i_t \cdot \tilde{C_t}$ as our new $C_t$

Output Gate

Based on our Cell information, we get the output $h_t$ or the $h_t-1$ for the next Cell.

$o_t$ is caculated by the sigmoid layer
output $h_t$ is computed by $o_t$ times Cell information pass through the tanh layer (-1 to 1)

LSTM Transformation

Gated Recurrent Unit: reset gate and update gate