Overview
Build the foundational understanding of policy gradient methods from first principles. Derive the policy gradient theorem, implement REINFORCE, and understand the actor-critic paradigm.
Goal: Deeply understand why we can optimize policies directly by maximizing expected reward.
Key Concepts
📋 Concepts
The Policy Gradient Theorem
Core idea: Instead of learning Q-values (value-based RL), directly parameterize and optimize the policy .
Policy Gradient Theorem:
Where:
- : Policy parameters
- : Trajectory (sequence of states, actions)
- : Return (total reward)
Intuition: Increase probability of actions that led to high reward.
REINFORCE Algorithm
The simplest policy gradient algorithm:
- Collect trajectories by running policy (\pi_\theta)
- Compute returns (R(\tau)) for each trajectory
- Update policy: (\theta \leftarrow \theta + \alpha \nabla\theta \log \pi\theta \cdot R(\tau))
Problem: High variance! Single trajectory return is noisy.
Solution: Baselines and advantage estimation.
Variance Reduction: Baselines
Key insight: Subtract a baseline from returns without biasing the gradient.
Common baseline: Value function leads to actor-critic.
Actor-Critic Architecture
Two networks:
- Actor: Policy (what to do)
- Critic: Value function (how good is this state)
Advantage:
Update rule:
Benefit: Lower variance than pure REINFORCE, faster learning.
Generalized Advantage Estimation (GAE)
Problem: Bias-variance trade-off in advantage estimation.
GAE solution: Exponentially weighted average of n-step advantages.
Where (TD error).
Tuning :
- : Low variance, high bias (TD learning)
- : High variance, low bias (Monte Carlo)
- : Good default
Key Resources
📚 Essential Reading
REINFORCE Algorithm Tutorial (Substack)
https://substack.com/inbox/post/170790602
Clear walkthrough of the REINFORCE algorithm with derivations.
The RLHF Book
https://rlhfbook.com/
Comprehensive resource for RL in the context of LLM post-training. Covers policy gradients, PPO, and RLHF applications. Essential for applied RL.
📖 Books
Foundations of Deep Reinforcement Learning by Graesser & Keng
Practical, code-first approach to deep RL. Includes PyTorch implementations of REINFORCE, actor-critic, and PPO.
Learning Path
Phase 1: Theory (4 hours)
- Derive policy gradient theorem from scratch
- Read REINFORCE tutorial
- Work through actor-critic derivation
Phase 2: Implementation (4 hours)
- Implement REINFORCE on CartPole
- Add baseline (value function)
- Implement actor-critic
- Compare variance: REINFORCE vs actor-critic
Phase 3: Deep Dive (2 hours)
- Read RLHF Book chapter on policy gradients
- Implement GAE
- Understand connection to PPO (next node)
Common Pitfalls
Γ¥î Forgetting the log: It’s , not .
❌ Biased baselines: Only state-dependent baselines are unbiased.
❌ Ignoring variance: High variance = slow/unstable learning. Always use baselines.
❌ Wrong advantage signs: Positive advantage - increase action probability.
Next Steps
- PPO from Scratch: The de facto standard policy gradient algorithm
- Preference Data & Eval Design: How to get reward signals from human preferences
Assessment Criteria
✅ You understand this node when you can:
- Derive the policy gradient theorem
- Implement REINFORCE from scratch
- Explain why baselines reduce variance without bias
- Code up an actor-critic agent
- Implement GAE and tune (\lambda)