Reinforcement Learning
Reinforcement Learning (RL) is learning through trial and error. An agent learns to make decisions by interacting with an environment, receiving rewards for good actions and penalties for bad ones.
Key Idea: Learn optimal behavior through experience, not from labeled examples. Like teaching a dog tricks with treats!
Core Concepts
The RL Loop
Key Algorithms
Q-Learning
Learn value of state-action pairs (Q-values).
Update: Q(s,a) ← Q(s,a) + α[r + γ max Q(s',a') - Q(s,a)]
Model-free, off-policy learning
Deep Q-Network (DQN)
Use neural network to approximate Q-function.
Experience replay + target network for stability
Breakthrough: Atari games, Go
Policy Gradient
Directly learn policy (action probabilities).
Works for continuous action spaces
Examples: REINFORCE, PPO, A3C
Actor-Critic
Combine value-based and policy-based methods.
Critic: learns value function (how good is state)
Best of both worlds: stable and efficient
Exploration vs Exploitation
Exploration
Try new actions to discover better strategies.
Exploitation
Use current knowledge to maximize reward.
ε-Greedy Strategy
Balance exploration and exploitation.
With probability 1-ε: exploit (best known action)
Typically start with high ε, decrease over time
Applications
Challenges
Key Takeaway: RL learns through interaction and rewards. It's powerful for sequential decision-making but requires careful design of rewards and exploration strategies.