Long Short-Term Memory (LSTM)

This interactive demo shows how LSTM networks process sequential data using a sophisticated gating mechanism. Unlike simple RNNs, LSTMs use gates to control information flow, allowing them to learn long-term dependencies without suffering from vanishing gradients.

You'll explore how the four gates (forget, input, input node, output) work together to maintain both hidden state and cell state, enabling the network to selectively remember and forget information.

Why Normalization Matters:
LSTM gates rely on sigmoid ($\sigma$) and tanh activations, which are bounded to [0,1] and [-1,1] respectively. When input values become very large (positive or negative), these activation functions saturate - sigmoid outputs stick at 0 or 1, and tanh sticks at -1 or 1. In these saturated regions, gradients vanish, preventing the network from learning. Normalizing input sequences to a bounded range (like [-1,1]) keeps values in the responsive region where activations can vary smoothly and gradients flow properly. This is critical for training LSTMs effectively.

Understanding Parameter Effects:
Each gate has separate weights for current input ($W_{fx}$, $W_{ix}$, etc.) and previous hidden state ($W_{fh}$, $W_{ih}$, etc.), plus biases ($b_{fx}$, $b_{fh}$, etc.). This separation gives fine-grained control: for example, a high $W_{fx}$ means the forget gate responds strongly when the current input $x$ is large (remembering more for large inputs), while very negative values for both $b_{ih}$ and $b_{ix}$ bias the input gate to stay nearly closed regardless of inputs, effectively ignoring new information. Positive forget gate biases make the network "remember by default". These interpretable parameters let you understand and control how the LSTM processes sequences.

Educational Context:
This demo uses a simplified scalar LSTM (single hidden dimension, 16 total parameters) and optimizes to intentionally overfit on a single sequence. This pedagogical approach makes parameter effects clearly visible - in real applications, you would use train/validation/test splits and vector-valued hidden states with many more parameters. The "Optimise" button finds parameters that minimize prediction error on the chosen sequence.
LSTM Architecture Components:

Cell State: $c_t$ - Long-term memory that flows through the network
Hidden State: $h_t$ - Short-term memory used for output
Forget Gate: $f_t$ - Decides what to discard from cell state
Input Gate: $i_t$ - Decides what new information to store
Input Node: $\tilde{c}_t$ - Creates candidate values to add to cell state
Output Gate: $o_t$ - Decides what to output based on cell state

LSTM Formulas (with separate input/hidden weights and biases):
$f_t = \sigma(W_{fx} x_t + b_{fx} + W_{fh} h_{t-1} + b_{fh})$
$i_t = \sigma(W_{ix} x_t + b_{ix} + W_{ih} h_{t-1} + b_{ih})$
$\tilde{c}_t = \tanh(W_{cx} x_t + b_{cx} + W_{ch} h_{t-1} + b_{ch})$
$o_t = \sigma(W_{ox} x_t + b_{ox} + W_{oh} h_{t-1} + b_{oh})$
$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$
$h_t = o_t \odot \tanh(c_t)$

where $\sigma$ is sigmoid, $\odot$ is element-wise multiplication, and each gate has separate parameters for input ($x$) and hidden state ($h$) transformations.
How to Use:

Select a sequence from the dropdown to see different patterns
Adjust gate weights to control how the LSTM processes information
Observe how the input node creates candidate values
Use "Step Forward" to manually process each timestep
Watch gate values displayed inside the LSTM cell
Use "Optimise" to find weights that predict the next sequence value

Sequence Types:
Fibonacci: Each number is sum of previous two
Linear: Simple counting sequence
Alternating: Pattern that switches between values
Exponential: Powers of 2 sequence
Sine Pattern: Smooth oscillating values
Understanding LSTM Parameters:

Input Weights ($W_{fx}$, $W_{ix}$, $W_{cx}$, $W_{ox}$): Control how each gate responds to current input. Positive values amplify input effect, negative values suppress it.
Hidden Weights ($W_{fh}$, $W_{ih}$, $W_{ch}$, $W_{oh}$): Control how each gate responds to previous hidden state. Enable temporal dependencies.
Biases ($b_{fx}$, $b_{ix}$, etc.): Set "default behavior" independent of inputs. Positive biases open gates, negative biases close them.

Parameter Interpretation Examples:
High $W_{fx}$ with negative $b_{fh}$: Forget more when input is large, remember when input is small
Negative $b_{ix}$ and $b_{ih}$: Input gate nearly always closed, ignoring new information
Positive $b_{fh}$: Forget gate biased open, remembers previous cell state by default
Large $W_{ch}$ with small $W_{cx}$: Input node responds more to temporal patterns than current input

Gate Interactions:
Forget + Input Gates: Work together to update cell state. If forget=1 and input=0: perfect memory. If forget=0 and input=1: complete replacement.
Cell State vs Hidden State: Cell state ($c_t$) uses additive updates preventing gradient vanishing. Hidden state ($h_t$) is a filtered, bounded view for output.
Output Gate: Controls how much cell state information reaches the hidden state and final output.

Input Sequence ($x_t$)

LSTM Cell

Output Sequence ($h_t$)

Input Sequence

1 1 2 3

Input Scaling

Forget Gate

Input Gate

Input Node

Output Gate

Input
Hidden
Input
Hidden
Input
Hidden
Input
Hidden
Sequence Processing History
Step -
Input (xt) -
Expected Next -
Cell State (Ct) -
Hidden State (ht) -
Error (Δ) -
Expected Δ Output -
Actual Δ Output -