Multi-Armed Bandit

Multi-Armed Bandit Problem

The multi-armed bandit is a classic problem in reinforcement learning that models the exploration vs exploitation dilemma. Imagine you're in a casino with multiple slot machines ("one-armed bandits"), each with unknown payout rates.

This problem appears everywhere:

A/B Testing - Which website design performs better?
Clinical Trials - Which treatment should we give to patients?
Online Advertising - Which ad gets more clicks?
Resource Allocation - Where should we invest limited resources?

The key challenge: Should you exploit what you know works, or explore to potentially find something better?

Mathematical Framework

At each time step $t$ , you choose an action $a_{t}$ and receive reward $r_{t}$ . The goal is to maximize cumulative reward over time.

Key Concepts:

Action Value: $q^{*} (a) = E [r_{t} | a_{t} = a]$ - true expected reward
Estimated Value: $Q_{t} (a) = \frac{1}{N_{t} (a)} \sum_{i = 1}^{t} r_{i} 1_{a_{i} = a}$ - sample average
Regret: $L_{t} = \sum_{i = 1}^{t} (q^{*} (a^{*}) - q^{*} (a_{i}))$ - cumulative opportunity cost

Epsilon-Greedy Algorithm:

a_{t} = {\begin{cases} \arg max_{a} Q_{t} (a) & with probability 1 - ϵ \\ random action & with probability ϵ \end{cases}

Where $ϵ \in [0, 1]$ controls the exploration rate:

$ϵ = 0$ : Pure exploitation (greedy)
$ϵ = 1$ : Pure exploration (random)
$ϵ = 0.1$ : 90% exploitation, 10% exploration

Why it works: As $t \to \infty$ , we explore all actions infinitely often, so $Q_{t} (a) \to q^{*} (a)$ and the algorithm finds the optimal action.

How to Use This Demo

Interactive Learning:

Pull arms - Click the "Pull Arm" buttons to try different arms (50 pulls total)
Watch statistics - Observe total reward, number of pulls, and mean reward for each arm
Develop strategy - Try to maximize your total reward
Reset game - Start over with the same arm distributions

Algorithm Demonstrations:

"Show Exploration Phase" - Watch how epsilon-greedy learns the arm values
"Show Trained Performance" - See how the algorithm performs after learning
"Reveal True Distributions" - Discover the actual reward probabilities

Visual Guide:

Total Reward: Sum of all rewards received from this arm
Number of Pulls: How many times you've tried this arm
Mean Reward: Average reward per pull (estimated value)

Strategy and Insights

Human vs Algorithm:

Try playing first, then compare with the epsilon-greedy algorithm
Humans often explore too little ("hot hand fallacy") or too much (overthinking)
The algorithm balances exploration and exploitation mathematically

Key Insights:

Early exploration matters - Don't commit too quickly to one arm
Sample size affects confidence - More pulls give better estimates
Opportunity cost is real - Every suboptimal choice costs potential reward
Perfect information is impossible - You must act with uncertainty

Real-world Applications:

Start with some exploration to gather data
Gradually shift toward exploitation as confidence grows
Consider the cost of exploration vs potential gains
Monitor for changes in the environment (non-stationary bandits)

Advanced Concepts:

Upper Confidence Bound (UCB) - More sophisticated than epsilon-greedy
Thompson Sampling - Bayesian approach to exploration
Contextual Bandits - Actions depend on observed context
Non-stationary Bandits - Reward distributions change over time

Live Competition Mode

Challenge: Maximize your total reward over 100 pulls and compete against your classmates!

Keep this tab active - switching to another tab may temporarily remove you from the leaderboard

Competition Rules:

You have exactly 100 pulls to maximize your reward
The arm distributions are different from practice mode
Your score is submitted automatically after all 100 pulls
One-time only: Your first completion counts - no retries
Real-time leaderboard shows top performers

Strategy Hints:

Balance exploration (trying different arms) with exploitation (using best-known arm)
Early exploration helps you find the best arm faster
Don't spend too many pulls exploring - you need to exploit to win!

Scoring:

Each arm has a hidden probability distribution
Your total reward is the sum of all rewards received
Higher total reward = better rank on the leaderboard

Interactive Arms - Try Your Strategy!

Arm 1

Total: 0

Pulls: 0

Mean: 0.000

Arm 2

Total: 0

Pulls: 0

Mean: 0.000

Arm 3

Total: 0

Pulls: 0

Mean: 0.000

Total Rewards: 0

Total Pulls: 0 / 50

Algorithm Demonstrations

See how the epsilon-greedy algorithm tackles this problem: