Long Short-Term Memory (LSTM)

Long Short-Term Memory (LSTM)

This interactive demo shows how LSTM networks process sequential data using a sophisticated gating mechanism. Unlike simple RNNs, LSTMs use gates to control information flow, allowing them to learn long-term dependencies without suffering from vanishing gradients.

You'll explore how the four gates (forget, input, input node, output) work together to maintain both hidden state and cell state, enabling the network to selectively remember and forget information.

Why Normalization Matters:
LSTM gates rely on sigmoid ($\sigma$) and tanh activations, which are bounded to [0,1] and [-1,1] respectively. When input values become very large (positive or negative), these activation functions saturate - sigmoid outputs stick at 0 or 1, and tanh sticks at -1 or 1. In these saturated regions, gradients vanish, preventing the network from learning. Normalizing input sequences to a bounded range (like [-1,1]) keeps values in the responsive region where activations can vary smoothly and gradients flow properly. This is critical for training LSTMs effectively.

Understanding Parameter Effects:
Each gate has separate weights for current input ($W_{fx}$, $W_{ix}$, etc.) and previous hidden state ($W_{fh}$, $W_{ih}$, etc.), plus biases ($b_{fx}$, $b_{fh}$, etc.). This separation gives fine-grained control: for example, a high $W_{fx}$ means the forget gate responds strongly when the current input $x$ is large (remembering more for large inputs), while very negative values for both $b_{ih}$ and $b_{ix}$ bias the input gate to stay nearly closed regardless of inputs, effectively ignoring new information. Positive forget gate biases make the network "remember by default". These interpretable parameters let you understand and control how the LSTM processes sequences.

Educational Context:
This demo uses a simplified scalar LSTM (single hidden dimension, 16 total parameters) and optimizes to intentionally overfit on a single sequence. This pedagogical approach makes parameter effects clearly visible - in real applications, you would use train/validation/test splits and vector-valued hidden states with many more parameters. The "Optimise" button finds parameters that minimize prediction error on the chosen sequence.

LSTM Architecture Components:

• Cell State: $c_t$ - Long-term memory that flows through the network
• Hidden State: $h_t$ - Short-term memory used for output
• Forget Gate: $f_t$ - Decides what to discard from cell state
• Input Gate: $i_t$ - Decides what new information to store
• Input Node: $\tilde{c}_t$ - Creates candidate values to add to cell state
• Output Gate: $o_t$ - Decides what to output based on cell state

LSTM Formulas (with separate input/hidden weights and biases):
$f_t = \sigma(W_{fx} x_t + b_{fx} + W_{fh} h_{t-1} + b_{fh})$
$i_t = \sigma(W_{ix} x_t + b_{ix} + W_{ih} h_{t-1} + b_{ih})$
$\tilde{c}_t = \tanh(W_{cx} x_t + b_{cx} + W_{ch} h_{t-1} + b_{ch})$
$o_t = \sigma(W_{ox} x_t + b_{ox} + W_{oh} h_{t-1} + b_{oh})$
$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$
$h_t = o_t \odot \tanh(c_t)$

where $\sigma$ is sigmoid, $\odot$ is element-wise multiplication, and each gate has separate parameters for input ($x$) and hidden state ($h$) transformations.

How to Use:

• Select a sequence from the dropdown to see different patterns
• Adjust gate weights to control how the LSTM processes information
• Observe how the input node creates candidate values
• Use "Step Forward" to manually process each timestep
• Watch gate values displayed inside the LSTM cell
• Use "Optimise" to find weights that predict the next sequence value

Sequence Types:
• Fibonacci: Each number is sum of previous two
• Linear: Simple counting sequence
• Alternating: Pattern that switches between values
• Exponential: Powers of 2 sequence
• Sine Pattern: Smooth oscillating values

Understanding LSTM Parameters:

• Input Weights ($W_{fx}$, $W_{ix}$, $W_{cx}$, $W_{ox}$): Control how each gate responds to current input. Positive values amplify input effect, negative values suppress it.
• Hidden Weights ($W_{fh}$, $W_{ih}$, $W_{ch}$, $W_{oh}$): Control how each gate responds to previous hidden state. Enable temporal dependencies.
• Biases ($b_{fx}$, $b_{ix}$, etc.): Set "default behavior" independent of inputs. Positive biases open gates, negative biases close them.

Parameter Interpretation Examples:
• High $W_{fx}$ with negative $b_{fh}$: Forget more when input is large, remember when input is small
• Negative $b_{ix}$ and $b_{ih}$: Input gate nearly always closed, ignoring new information
• Positive $b_{fh}$: Forget gate biased open, remembers previous cell state by default
• Large $W_{ch}$ with small $W_{cx}$: Input node responds more to temporal patterns than current input

Gate Interactions:
• Forget + Input Gates: Work together to update cell state. If forget=1 and input=0: perfect memory. If forget=0 and input=1: complete replacement.
• Cell State vs Hidden State: Cell state ($c_t$) uses additive updates preventing gradient vanishing. Hidden state ($h_t$) is a filtered, bounded view for output.
• Output Gate: Controls how much cell state information reaches the hidden state and final output.

Input Sequence ($x_t$)

LSTM Cell

Output Sequence ($h_t$)

Input Sequence

Input Scaling

Forget Gate

Input Gate

Input Node

Output Gate

Input

Hidden

Input

Hidden

Input

Hidden

Input

Hidden

Sequence Processing History

Step	-
Input (x_t)	-
Expected Next	-
Cell State (C_t)	-
Hidden State (h_t)	-
Error (Δ)	-
Expected Δ Output	-
Actual Δ Output	-