Recurrent Neural Networks

Recurrent Neural Networks (RNNs)

This interactive demo shows how Recurrent Neural Networks process sequential data step-by-step. RNNs have a "memory" that allows them to use information from previous timesteps when processing current inputs.

You'll explore how the hidden state evolves over time and see how different weights and activation functions affect the network's behavior on various sequence patterns.

RNN Architecture Components:

• Hidden State:

h_{t}

- Memory that carries information from previous timesteps
• Input Weight:

W_{i n p u t}

- How much the current input affects the hidden state
• Hidden Weight:

W_{h i d d e n}

- How much the previous hidden state influences the current one
• Output Weight:

W_{o u t p u t}

- Projects hidden state to prediction space
• Biases: Constant terms added to the computation
• Activation Function: Non-linear transformation (tanh, ReLU, sigmoid)

RNN Formulas:

h_{t} = f (W_{i n p u t} \cdot x_{t} + W_{h i d d e n} \cdot h_{t - 1} + b_{h i d d e n})

y_{t} = W_{o u t p u t} \cdot h_{t} + b_{o u t p u t}

Where

f

is the activation function,

x_{t}

is the input at time

t

h_{t - 1}

is the previous hidden state, and

y_{t}

is the output prediction.

Note: The output layer separates internal memory (

h_{t}

) from task-specific predictions (

y_{t}

How to Use:

• Select a sequence to see different patterns (Fibonacci, linear, etc.)
• Adjust weights to see how they affect hidden state and output evolution
• Try different activation functions to understand their impact
• Use "Step Forward" to manually process each timestep
• Click "Train (BPTT)" to train using gradient-based backpropagation through time (the real algorithm)
• Click "Random Search" for comparison baseline (tries random weights without gradients)
• Toggle "Raw/Normalized Inputs" to scale sequences to [-1, 1] range

Sequence Types:
• Fibonacci: Each number is sum of previous two (requires richer state)
• Linear: Simple counting sequence
• Alternating: Pattern that switches between values
• Exponential: Powers of 2 sequence
• Sine Pattern: Smooth oscillating values

Understanding RNN Behavior:

• Input Weight ( $W_{i n p u t}$ ): Higher values make current input more influential
• Hidden Weight ( $W_{h i d d e n}$ ): Controls memory retention. |

W_{h i d d e n}

| > 1 can cause explosion, < 1 causes decay
• Output Layer: Separates internal memory (

h_{t}

) from predictions (

y_{t}

). Essential for flexible modeling
• Activation Functions:
- Tanh: Outputs between -1 and 1, standard for RNN hidden states
- ReLU: Only positive outputs, can cause exploding gradients in recurrence
- Sigmoid: Outputs between 0 and 1, prone to vanishing gradients
- Linear: No saturation but unstable without careful weight tuning

Training Insights:
• BPTT vs Random Search: "Train (BPTT)" uses gradient descent (the real algorithm used in practice). "Random Search" tries random weight combinations as a baseline for comparison. BPTT is vastly more efficient and shows why gradient-based optimization revolutionized neural networks.
• BPTT Algorithm: Backpropagation through time unfolds the network and computes gradients across all timesteps using the chain rule. Updates weights via gradient descent: w = w - lr * ∇w.
• Gradient Clipping: The demo clips gradients to [-5, 5] to prevent explosion, a critical technique for RNN training
• Learning Rate: Controls step size during gradient descent. Too high causes instability, too low causes slow convergence
• Normalization: Helps prevent activation saturation with bounded functions

Limitations:
• This demo uses a single scalar hidden unit. Real RNNs use vector hidden states for richer memory
• Fibonacci requires memory of two previous values - a single scalar

h_{t}

is insufficient

Input Sequence ( $x_{t}$ )

RNN Cell

Loss: -

Grad Norm: -

Output Sequence ( $y_{t}$ )

Input Sequence

Activation Function

Input Scaling

Learning Rate

0.010

W_{i n p u t}

(Input Weight)

1.00

W_{h i d d e n}

(Hidden Weight)

0.50

W_{o u t p u t}

(Output Weight)

1.00

b_{h i d d e n}

(Hidden Bias)

0.0

b_{o u t p u t}

(Output Bias)

0.0

Sequence Processing History

Step	-
Input (x_t)	-
Expected Next	-
Actual Output (y_t)	-
Error (Δ)	-
Expected Δ Output	-
Actual Δ Output	-