Multi-Armed Bandit Problem
The multi-armed bandit is a classic problem in reinforcement learning that models the exploration vs exploitation dilemma. Imagine you're in a casino with multiple slot machines ("one-armed bandits"), each with unknown payout rates.
This problem appears everywhere:
- A/B Testing - Which website design performs better?
- Clinical Trials - Which treatment should we give to patients?
- Online Advertising - Which ad gets more clicks?
- Resource Allocation - Where should we invest limited resources?
The key challenge: Should you exploit what you know works, or explore to potentially find something better?
Mathematical Framework
At each time step , you choose an action and receive reward . The goal is to maximize cumulative reward over time.
Key Concepts:
- Action Value: - true expected reward
- Estimated Value: - sample average
- Regret: - cumulative opportunity cost
Epsilon-Greedy Algorithm:
with probability random actionwith probability
Where controls the exploration rate:
- : Pure exploitation (greedy)
- : Pure exploration (random)
- : 90% exploitation, 10% exploration
Why it works: As , we explore all actions infinitely often, so and the algorithm finds the optimal action.
How to Use This Demo
Interactive Learning:
- Pull arms - Click the "Pull Arm" buttons to try different arms (50 pulls total)
- Watch statistics - Observe total reward, number of pulls, and mean reward for each arm
- Develop strategy - Try to maximize your total reward
- Reset game - Start over with the same arm distributions
Algorithm Demonstrations:
- "Show Exploration Phase" - Watch how epsilon-greedy learns the arm values
- "Show Trained Performance" - See how the algorithm performs after learning
- "Reveal True Distributions" - Discover the actual reward probabilities
Visual Guide:
- Total Reward: Sum of all rewards received from this arm
- Number of Pulls: How many times you've tried this arm
- Mean Reward: Average reward per pull (estimated value)
Strategy and Insights
Human vs Algorithm:
- Try playing first, then compare with the epsilon-greedy algorithm
- Humans often explore too little ("hot hand fallacy") or too much (overthinking)
- The algorithm balances exploration and exploitation mathematically
Key Insights:
- Early exploration matters - Don't commit too quickly to one arm
- Sample size affects confidence - More pulls give better estimates
- Opportunity cost is real - Every suboptimal choice costs potential reward
- Perfect information is impossible - You must act with uncertainty
Real-world Applications:
- Start with some exploration to gather data
- Gradually shift toward exploitation as confidence grows
- Consider the cost of exploration vs potential gains
- Monitor for changes in the environment (non-stationary bandits)
Advanced Concepts:
- Upper Confidence Bound (UCB) - More sophisticated than epsilon-greedy
- Thompson Sampling - Bayesian approach to exploration
- Contextual Bandits - Actions depend on observed context
- Non-stationary Bandits - Reward distributions change over time
Live Competition Mode
Challenge: Maximize your total reward over 100 pulls and compete against your classmates!
Keep this tab active - switching to another tab may temporarily remove you from the leaderboard
Competition Rules:
- You have exactly 100 pulls to maximize your reward
- The arm distributions are different from practice mode
- Your score is submitted automatically after all 100 pulls
- One-time only: Your first completion counts - no retries
- Real-time leaderboard shows top performers
Your anonymous name:
Strategy Hints:
- Balance exploration (trying different arms) with exploitation (using best-known arm)
- Early exploration helps you find the best arm faster
- Don't spend too many pulls exploring - you need to exploit to win!
Scoring:
- Each arm has a hidden probability distribution
- Your total reward is the sum of all rewards received
- Higher total reward = better rank on the leaderboard