Comparing Model-Free RL Algorithms

This demonstration compares three fundamental reinforcement learning algorithms on the CartPole balancing task:

  • Monte Carlo (MC): Updates Q-values at episode end using the full episode return
  • SARSA: On-policy temporal difference learning using the actual next action taken
  • Q-learning: Off-policy temporal difference learning using the maximum Q-value over next actions

All three agents use tabular Q-tables with discretized states. You can switch between training checkpoints to see how each algorithm learns over time.

Note: Unlike the CartPole demo, episodes here terminate immediately when the pole angle or cart position exceeds the bounds, matching the standard Gymnasium environment behavior.

Key Concepts

  • On-policy vs Off-policy: SARSA learns from actions it actually takes, while Q-learning learns from optimal actions
  • Temporal Difference: TD methods (SARSA, Q-learning) update after each step, while MC waits until episode end
  • Exploration vs Exploitation: All algorithms use epsilon-greedy policies with decaying exploration

Model-Free Control Refresher

All agents maximize the same discounted return but bootstrap differently:

  • Monte Carlo: wait for the full episode return \(G_t\) and update \(Q \leftarrow Q + \alpha (G_t - Q)\). Low bias, high variance.
  • SARSA: on-policy target \(R_{t+1} + \gamma Q(S_{t+1}, A_{t+1})\). Learns cautious policies that reflect the exploration behavior.
  • Q-learning: off-policy target \(R_{t+1} + \gamma \max_a Q(S_{t+1}, a)\). Fast convergence but can overestimate with noisy values.

Each agent uses epsilon-greedy exploration. SARSA and Q-learning now decay \(\epsilon\) per environment step so TD methods see the same exploration schedule regardless of episode length, while Monte Carlo continues to decay after each episode.

State Discretization

CartPole’s continuous state \((x, \dot{x}, \theta, \dot{\theta})\) is bucketed into 5 bins per dimension (625 states) with 2 discrete actions. The resulting Q-table stores 1,250 values per checkpoint.

Variable Range Bins
Cart Position (x) [-2.4, 2.4] m 5 bins
Cart Velocity (ẋ) [-3, 3] m/s 5 bins
Pole Angle (θ) [-0.21, 0.21] rad 5 bins
Angular Velocity (θ̇) [-2, 2] rad/s 5 bins

Trade-off: coarse bins speed up learning but miss nuance; finer bins explode the state space.

Why MC Looks Strong Here

  • Episodic fit: CartPole is short-horizon and resets often, so using full-episode returns matches the problem well and avoids bootstrapping bias.
  • Coarse bins: With 5×5×5×5 discretization, Q-values are noisy; MC’s unbiased targets handle this noise better than TD’s max over rough estimates.
  • Exploration schedule: MC decays \(\epsilon\) per episode, so it turns greedy sooner than the TD agents (which decay per step), lifting its logged returns earlier.
  • On-policy stability: SARSA learns the behavior policy and can be cautious; Q-learning’s off-policy max can overestimate. MC sidesteps both issues.
  • Why the curve looks jagged: In RL you’re averaging over entire episodes, not i.i.d. samples. Returns swing with stochastic starts and exploration, so curves are much noisier than supervised loss plots. The practical signal is the trend and the greedy-eval marker—not the small bumps.

How to Use This Demo

  • Select a checkpoint (0, 10k, 80k, 150k steps) to load the saved Q-table and statistics for all three algorithms.
  • Play / Step / Reset controls run the current agent policy inside the browser CartPole environment. Reset also stops playback.
  • Learning curve panel plots return vs. training steps (rescaled from logged episodes). Use it to compare sample efficiency.
  • Q-table heatmap shows a slice through \(x=0, \dot{x}=0\) with the greedy action arrow. Hover between checkpoints to see how policies sharpen.
  • Metrics row shows training average return, greedy-policy average return (evaluated in-browser), and the exploration rate at that checkpoint.
  • Checkpoint notes under the buttons explain what changes between training stages—read them while flipping between algorithms.

Training Checkpoint

Currently viewing: Random initialization (0 steps)

Random initialization - watch all three agents fail to balance!

Monte Carlo

Episode-end updates
Attempt: 01

Learning Curve

Q-Table (θ vs θ̇)

SARSA

On-policy TD
Attempt: 01

Learning Curve

Q-Table (θ vs θ̇)

Q-learning

Off-policy TD
Attempt: 01

Learning Curve

Q-Table (θ vs θ̇)