Reinforcement Learning

Reinforcement Learning (RL) is learning through trial and error. An agent learns to make decisions by interacting with an environment, receiving rewards for good actions and penalties for bad ones.

Key Idea: Learn optimal behavior through experience, not from labeled examples. Like teaching a dog tricks with treats!

Core Concepts

A
Agent
The learner/decision maker (e.g., game player, robot)
E
Environment
The world the agent interacts with
S
State
Current situation/configuration
A
Action
What the agent can do
R
Reward
Feedback signal (positive or negative)

The RL Loop

1
Observe State
Agent sees current state of environment
2
Choose Action
Agent selects action based on policy
3
Receive Reward
Environment gives reward signal
4
Update Policy
Agent learns to improve future decisions
5
Repeat
Continue until goal is achieved

Key Algorithms

Q-Learning

Learn value of state-action pairs (Q-values).

Q(s,a) = expected total reward from state s, taking action a
Update: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]

Model-free, off-policy learning

Deep Q-Network (DQN)

Use neural network to approximate Q-function.

Combines Q-learning with deep learning
Experience replay + target network for stability

Breakthrough: Atari games, Go

Policy Gradient

Directly learn policy (action probabilities).

Optimize policy parameters to maximize expected reward
Works for continuous action spaces

Examples: REINFORCE, PPO, A3C

Actor-Critic

Combine value-based and policy-based methods.

Actor: learns policy (what to do)
Critic: learns value function (how good is state)

Best of both worlds: stable and efficient

Exploration vs Exploitation

Exploration

Try new actions to discover better strategies.

"Should I try a new restaurant?"

Exploitation

Use current knowledge to maximize reward.

"Should I go to my favorite restaurant?"

ε-Greedy Strategy

Balance exploration and exploitation.

With probability ε: explore (random action)
With probability 1-ε: exploit (best known action)
Typically start with high ε, decrease over time

Applications

🎮
Game Playing
Chess, Go, Atari, Dota 2
🤖
Robotics
Walking, grasping, navigation
🚗
Autonomous Vehicles
Self-driving cars
💰
Finance
Trading, portfolio optimization
🏭
Resource Management
Data center cooling, energy
🎯
Recommendation
Personalized content delivery

Challenges

⚠️Sample inefficiency: needs many interactions
⚠️Credit assignment: which action led to reward?
⚠️Sparse rewards: feedback is delayed or rare
⚠️Exploration: finding good strategies is hard
⚠️Stability: training can be unstable

Key Takeaway: RL learns through interaction and rewards. It's powerful for sequential decision-making but requires careful design of rewards and exploration strategies.