Multi-Armed Bandit Problem

The multi-armed bandit is a classic problem in reinforcement learning that models the exploration vs exploitation dilemma. Imagine you're in a casino with multiple slot machines ("one-armed bandits"), each with unknown payout rates.

This problem appears everywhere:

  • A/B Testing - Which website design performs better?
  • Clinical Trials - Which treatment should we give to patients?
  • Online Advertising - Which ad gets more clicks?
  • Resource Allocation - Where should we invest limited resources?

The key challenge: Should you exploit what you know works, or explore to potentially find something better?

Mathematical Framework

At each time step t, you choose an action at and receive reward rt. The goal is to maximize cumulative reward over time.

Key Concepts:

  • Action Value: q(a)=E[rt|at=a] - true expected reward
  • Estimated Value: Qt(a)=1Nt(a)i=1tri1ai=a - sample average
  • Regret: Lt=i=1t(q(a)q(ai)) - cumulative opportunity cost

Epsilon-Greedy Algorithm:

at={argmaxaQt(a)with probability 1ϵrandom actionwith probability ϵ

Where ϵ[0,1] controls the exploration rate:

  • ϵ=0: Pure exploitation (greedy)
  • ϵ=1: Pure exploration (random)
  • ϵ=0.1: 90% exploitation, 10% exploration

Why it works: As t, we explore all actions infinitely often, so Qt(a)q(a) and the algorithm finds the optimal action.

How to Use This Demo

Interactive Learning:

  • Pull arms - Click the "Pull Arm" buttons to try different arms (50 pulls total)
  • Watch statistics - Observe total reward, number of pulls, and mean reward for each arm
  • Develop strategy - Try to maximize your total reward
  • Reset game - Start over with the same arm distributions

Algorithm Demonstrations:

  • "Show Exploration Phase" - Watch how epsilon-greedy learns the arm values
  • "Show Trained Performance" - See how the algorithm performs after learning
  • "Reveal True Distributions" - Discover the actual reward probabilities

Visual Guide:

  • Total Reward: Sum of all rewards received from this arm
  • Number of Pulls: How many times you've tried this arm
  • Mean Reward: Average reward per pull (estimated value)

Strategy and Insights

Human vs Algorithm:

  • Try playing first, then compare with the epsilon-greedy algorithm
  • Humans often explore too little ("hot hand fallacy") or too much (overthinking)
  • The algorithm balances exploration and exploitation mathematically

Key Insights:

  • Early exploration matters - Don't commit too quickly to one arm
  • Sample size affects confidence - More pulls give better estimates
  • Opportunity cost is real - Every suboptimal choice costs potential reward
  • Perfect information is impossible - You must act with uncertainty

Real-world Applications:

  • Start with some exploration to gather data
  • Gradually shift toward exploitation as confidence grows
  • Consider the cost of exploration vs potential gains
  • Monitor for changes in the environment (non-stationary bandits)

Advanced Concepts:

  • Upper Confidence Bound (UCB) - More sophisticated than epsilon-greedy
  • Thompson Sampling - Bayesian approach to exploration
  • Contextual Bandits - Actions depend on observed context
  • Non-stationary Bandits - Reward distributions change over time

Live Competition Mode

Challenge: Maximize your total reward over 100 pulls and compete against your classmates!

Keep this tab active - switching to another tab may temporarily remove you from the leaderboard

Competition Rules:

  • You have exactly 100 pulls to maximize your reward
  • The arm distributions are different from practice mode
  • Your score is submitted automatically after all 100 pulls
  • One-time only: Your first completion counts - no retries
  • Real-time leaderboard shows top performers

Strategy Hints:

  • Balance exploration (trying different arms) with exploitation (using best-known arm)
  • Early exploration helps you find the best arm faster
  • Don't spend too many pulls exploring - you need to exploit to win!

Scoring:

  • Each arm has a hidden probability distribution
  • Your total reward is the sum of all rewards received
  • Higher total reward = better rank on the leaderboard

Interactive Arms - Try Your Strategy!

Arm 1

Total: 0
Pulls: 0
Mean: 0.000

Arm 2

Total: 0
Pulls: 0
Mean: 0.000

Arm 3

Total: 0
Pulls: 0
Mean: 0.000

Total Rewards: 0

Total Pulls: 0 / 50

Algorithm Demonstrations

See how the epsilon-greedy algorithm tackles this problem: