Reinforcement Learning

Reinforcement Learning (RL) is learning through trial and error. An agent learns to make decisions by interacting with an environment, receiving rewards for good actions and penalties for bad ones.

Key Idea: Learn optimal behavior through experience, not from labeled examples. Like teaching a dog tricks with treats!

Core Concepts

Agent

The learner/decision maker (e.g., game player, robot)

Environment

The world the agent interacts with

State

Current situation/configuration

Action

What the agent can do

Reward

Feedback signal (positive or negative)

The RL Loop

Observe State

Agent sees current state of environment

Choose Action

Agent selects action based on policy

Receive Reward

Environment gives reward signal

Update Policy

Agent learns to improve future decisions

Repeat

Continue until goal is achieved

Key Algorithms

Q-Learning

Learn value of state-action pairs (Q-values).

Q(s,a) = expected total reward from state s, taking action a
Update: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]

Model-free, off-policy learning

Deep Q-Network (DQN)

Use neural network to approximate Q-function.

Combines Q-learning with deep learning
Experience replay + target network for stability

Breakthrough: Atari games, Go

Policy Gradient

Directly learn policy (action probabilities).

Optimize policy parameters to maximize expected reward
Works for continuous action spaces

Examples: REINFORCE, PPO, A3C

Actor-Critic

Combine value-based and policy-based methods.

Actor: learns policy (what to do)
Critic: learns value function (how good is state)

Best of both worlds: stable and efficient

Exploration vs Exploitation

Exploration

Try new actions to discover better strategies.

"Should I try a new restaurant?"

Exploitation

Use current knowledge to maximize reward.

"Should I go to my favorite restaurant?"

ε-Greedy Strategy

Balance exploration and exploitation.

With probability ε: explore (random action)
With probability 1-ε: exploit (best known action)
Typically start with high ε, decrease over time

Applications

🎮

Game Playing

Chess, Go, Atari, Dota 2

🤖

Robotics

Walking, grasping, navigation

🚗

Autonomous Vehicles

Self-driving cars

💰

Finance

Trading, portfolio optimization

🏭

Resource Management

Data center cooling, energy

🎯

Recommendation

Personalized content delivery

Challenges

⚠️Sample inefficiency: needs many interactions

⚠️Credit assignment: which action led to reward?

⚠️Sparse rewards: feedback is delayed or rare

⚠️Exploration: finding good strategies is hard

⚠️Stability: training can be unstable

Key Takeaway: RL learns through interaction and rewards. It's powerful for sequential decision-making but requires careful design of rewards and exploration strategies.