Multi-Armed Bandit Problem

The multi-armed bandit is a classic problem in reinforcement learning that models the exploration vs exploitation dilemma. Imagine you're in a casino with multiple slot machines ("one-armed bandits"), each with unknown payout rates.

This problem appears everywhere:

  • A/B Testing - Which website design performs better?
  • Clinical Trials - Which treatment should we give to patients?
  • Online Advertising - Which ad gets more clicks?
  • Resource Allocation - Where should we invest limited resources?

The key challenge: Should you exploit what you know works, or explore to potentially find something better?

Mathematical Framework

At each time step $t$, you choose an action $a_t$ and receive reward $r_t$. The goal is to maximize cumulative reward over time.

Key Concepts:

  • Action Value: $q^*(a) = \mathbb{E}[r_t | a_t = a]$ - true expected reward
  • Estimated Value: $Q_t(a) = \frac{1}{N_t(a)} \sum_{i=1}^t r_i \mathbf{1}_{a_i = a}$ - sample average
  • Regret: $L_t = \sum_{i=1}^t (q^*(a^*) - q^*(a_i))$ - cumulative opportunity cost

Epsilon-Greedy Algorithm:

$$a_t = \begin{cases} \arg\max_a Q_t(a) & \text{with probability } 1-\epsilon \\ \text{random action} & \text{with probability } \epsilon \end{cases}$$

Where $\epsilon \in [0,1]$ controls the exploration rate:

  • $\epsilon = 0$: Pure exploitation (greedy)
  • $\epsilon = 1$: Pure exploration (random)
  • $\epsilon = 0.1$: 90% exploitation, 10% exploration

Why it works: As $t \to \infty$, we explore all actions infinitely often, so $Q_t(a) \to q^*(a)$ and the algorithm finds the optimal action.

How to Use This Demo

Interactive Learning:

  • Pull arms - Click the "Pull Arm" buttons to try different arms (50 pulls total)
  • Watch statistics - Observe total reward, number of pulls, and mean reward for each arm
  • Develop strategy - Try to maximize your total reward
  • Reset game - Start over with the same arm distributions

Algorithm Demonstrations:

  • "Show Exploration Phase" - Watch how epsilon-greedy learns the arm values
  • "Show Trained Performance" - See how the algorithm performs after learning
  • "Reveal True Distributions" - Discover the actual reward probabilities

Visual Guide:

  • Total Reward: Sum of all rewards received from this arm
  • Number of Pulls: How many times you've tried this arm
  • Mean Reward: Average reward per pull (estimated value)

Strategy and Insights

Human vs Algorithm:

  • Try playing first, then compare with the epsilon-greedy algorithm
  • Humans often explore too little ("hot hand fallacy") or too much (overthinking)
  • The algorithm balances exploration and exploitation mathematically

Key Insights:

  • Early exploration matters - Don't commit too quickly to one arm
  • Sample size affects confidence - More pulls give better estimates
  • Opportunity cost is real - Every suboptimal choice costs potential reward
  • Perfect information is impossible - You must act with uncertainty

Real-world Applications:

  • Start with some exploration to gather data
  • Gradually shift toward exploitation as confidence grows
  • Consider the cost of exploration vs potential gains
  • Monitor for changes in the environment (non-stationary bandits)

Advanced Concepts:

  • Upper Confidence Bound (UCB) - More sophisticated than epsilon-greedy
  • Thompson Sampling - Bayesian approach to exploration
  • Contextual Bandits - Actions depend on observed context
  • Non-stationary Bandits - Reward distributions change over time

Live Competition Mode

Challenge: Maximize your total reward over 100 pulls and compete against your classmates!

Keep this tab active - switching to another tab may temporarily remove you from the leaderboard

Competition Rules:

  • You have exactly 100 pulls to maximize your reward
  • The arm distributions are different from practice mode
  • Your score is submitted automatically after all 100 pulls
  • One-time only: Your first completion counts - no retries
  • Real-time leaderboard shows top performers

Strategy Hints:

  • Balance exploration (trying different arms) with exploitation (using best-known arm)
  • Early exploration helps you find the best arm faster
  • Don't spend too many pulls exploring - you need to exploit to win!

Scoring:

  • Each arm has a hidden probability distribution
  • Your total reward is the sum of all rewards received
  • Higher total reward = better rank on the leaderboard

Interactive Arms - Try Your Strategy!

Arm 1

Total: 0
Pulls: 0
Mean: 0.00

Arm 2

Total: 0
Pulls: 0
Mean: 0.00

Arm 3

Total: 0
Pulls: 0
Mean: 0.00

Total Rewards: 0

Total Pulls: 0 / 50

Algorithm Demonstrations

See how the epsilon-greedy algorithm tackles this problem: