Policy Iteration

Policy Iteration is a fundamental reinforcement learning algorithm that finds the optimal policy by alternating between two steps: evaluating how good the current policy is, and improving the policy based on that evaluation.

Why Policy Iteration?

Guaranteed convergence - Always finds the optimal policy for the gridworld
Clear structure - Separates evaluation and improvement into distinct phases
Foundation for RL - Core concepts appear in many modern RL algorithms

The Two Phases:

Policy Evaluation: Calculate the value V(s) of each state under the current policy
Policy Improvement: Update the policy to choose better actions based on those values

Watch as the algorithm discovers the optimal path through the gridworld, iteratively refining both the value estimates and the policy!

How Policy Iteration Works

Policy Evaluation Phase:

Compute the value V(s) for each state by repeatedly applying the Bellman equation until values stabilize:

V (s) \leftarrow R (s, π (s)) + γ \sum_{s^{'}} P (s^{'} | s, π (s)) V (s^{'})

Where:

$V (s)$ = value of state $s$
$R (s, a)$ = immediate reward
$γ$ = discount factor (0.9)
$π (s)$ = current policy (action to take in state $s$ )

Policy Improvement Phase:

Update the policy to be greedy with respect to the current values:

π (s) \leftarrow \arg max_{a} [R (s, a) + γ \sum_{s^{'}} P (s^{'} | s, a) V (s^{'})]

Choose the action that leads to the highest expected value.

Convergence:

Policy Evaluation converges when value updates are below threshold (δ < 0.01)
Policy Iteration converges when the policy stops changing
The final policy is guaranteed to be optimal

Grid World Specifics:

Goal reward: +10
Step penalty: -1
Discount factor: γ = 0.9
Deterministic transitions (actions always succeed)

How to Use This Demo

Value Grid (Left):

Shows the estimated value V(s) of each state
Higher values indicate states closer to the goal
Click "Policy Evaluation Step" to update values for one iteration
Values stabilize when the current policy is fully evaluated

Policy Grid (Right):

Shows the current policy as arrows (↑↓←→)
Arrows point to the best action from each state
Click "Policy Improvement Step" to update the policy based on current values
Policy stabilizes when it becomes optimal

Control Buttons:

"Policy Evaluation Step": Run one iteration of value updates
"Policy Improvement Step": Update policy to be greedy with current values
"Run Until Convergence": Automatically iterate until optimal policy found
"Reset": Start over with zero values and random policy

Agent Visualization (Bottom):

Click "Animate Agent" to see the agent follow the current policy
Agent starts at bottom-left, follows policy arrows to reach goal
Watch how the path improves as the policy gets better!

What to Observe

Starting Out:

Initially, all values are zero (except goal = +10)
The policy starts random (may not lead to goal)
Try animating the agent at start - it won't reach the goal efficiently!

During Policy Evaluation:

Values propagate backward from the goal
States near the goal get higher values first
Multiple evaluation steps needed for values to stabilize
Watch the step penalty (-1) affect values of distant states

During Policy Improvement:

Policy arrows update to point toward higher-value states
Usually only takes one improvement step after full evaluation
Policy changes dramatically in early iterations
Later iterations show smaller refinements

Convergence:

Values stabilize to reflect true expected return from each state
Policy arrows show the optimal path from every position
Agent animation follows the shortest path to goal
Typical convergence: 5-10 full policy iteration cycles

Experiment:

Try doing several evaluation steps before one improvement step
Click "Reset" and watch the learning process from scratch
Compare the final policy to your manual solution from lec08b!

Value Function V(s)

Ready

Policy π(s)

Ready

Iterations: 0 Policy Stable: No

Agent Following Current Policy

Steps: 0 Reward: 0 Status: Ready