Policy Iteration

Policy Iteration is a fundamental reinforcement learning algorithm that finds the optimal policy by alternating between two steps: evaluating how good the current policy is, and improving the policy based on that evaluation.

Why Policy Iteration?

  • Guaranteed convergence - Always finds the optimal policy for the gridworld
  • Clear structure - Separates evaluation and improvement into distinct phases
  • Foundation for RL - Core concepts appear in many modern RL algorithms

The Two Phases:

  • Policy Evaluation: Calculate the value V(s) of each state under the current policy
  • Policy Improvement: Update the policy to choose better actions based on those values

Watch as the algorithm discovers the optimal path through the gridworld, iteratively refining both the value estimates and the policy!

How Policy Iteration Works

Policy Evaluation Phase:

Compute the value V(s) for each state by repeatedly applying the Bellman equation until values stabilize:

V(s)R(s,π(s))+γsP(s|s,π(s))V(s)

Where:

  • V(s) = value of state s
  • R(s,a) = immediate reward
  • γ = discount factor (0.9)
  • π(s) = current policy (action to take in state s)

Policy Improvement Phase:

Update the policy to be greedy with respect to the current values:

π(s)argmaxa[R(s,a)+γsP(s|s,a)V(s)]

Choose the action that leads to the highest expected value.

Convergence:

  • Policy Evaluation converges when value updates are below threshold (δ < 0.01)
  • Policy Iteration converges when the policy stops changing
  • The final policy is guaranteed to be optimal

Grid World Specifics:

  • Goal reward: +10
  • Step penalty: -1
  • Discount factor: γ = 0.9
  • Deterministic transitions (actions always succeed)

How to Use This Demo

Value Grid (Left):

  • Shows the estimated value V(s) of each state
  • Higher values indicate states closer to the goal
  • Click "Policy Evaluation Step" to update values for one iteration
  • Values stabilize when the current policy is fully evaluated

Policy Grid (Right):

  • Shows the current policy as arrows (↑↓←→)
  • Arrows point to the best action from each state
  • Click "Policy Improvement Step" to update the policy based on current values
  • Policy stabilizes when it becomes optimal

Control Buttons:

  • "Policy Evaluation Step": Run one iteration of value updates
  • "Policy Improvement Step": Update policy to be greedy with current values
  • "Run Until Convergence": Automatically iterate until optimal policy found
  • "Reset": Start over with zero values and random policy

Agent Visualization (Bottom):

  • Click "Animate Agent" to see the agent follow the current policy
  • Agent starts at bottom-left, follows policy arrows to reach goal
  • Watch how the path improves as the policy gets better!

What to Observe

Starting Out:

  • Initially, all values are zero (except goal = +10)
  • The policy starts random (may not lead to goal)
  • Try animating the agent at start - it won't reach the goal efficiently!

During Policy Evaluation:

  • Values propagate backward from the goal
  • States near the goal get higher values first
  • Multiple evaluation steps needed for values to stabilize
  • Watch the step penalty (-1) affect values of distant states

During Policy Improvement:

  • Policy arrows update to point toward higher-value states
  • Usually only takes one improvement step after full evaluation
  • Policy changes dramatically in early iterations
  • Later iterations show smaller refinements

Convergence:

  • Values stabilize to reflect true expected return from each state
  • Policy arrows show the optimal path from every position
  • Agent animation follows the shortest path to goal
  • Typical convergence: 5-10 full policy iteration cycles

Experiment:

  • Try doing several evaluation steps before one improvement step
  • Click "Reset" and watch the learning process from scratch
  • Compare the final policy to your manual solution from lec08b!

Value Function V(s)

Ready

Policy π(s)

Ready
Iterations: 0 Policy Stable: No

Agent Following Current Policy

Steps: 0 Reward: 0 Status: Ready