Overview
Implement Proximal Policy Optimization (PPO) from scratch and build strong evaluation discipline. PPO is the workhorse of modern RLHF and the foundation for understanding post-training.
Goal: Build a production-grade PPO implementation you can trust and extend.
Key Concepts
📋 Concepts
Why PPO?
Problem with vanilla policy gradients: Large policy updates can catastrophically degrade performance.
PPO solution: Constrain policy updates to a “trust region” using a clipped objective.
The PPO Objective
Clipped surrogate objective:
Where:
- : Probability ratio (importance sampling)
- : Advantage estimate (from GAE)
- : Clip range (typically 0.2)
Intuition: If ratio greater than 1+╬╡ or less than 1-╬╡, clip it. Don’t let policy change too much.
Implementation Checklist
Core Components
✅ Policy network (): Output action probabilities
✅ Value network (): Estimate state values
✅ Advantage computation (GAE with )
✅ Probability ratio:
✅ Clipped objective:
✅ Value loss: MSE between predicted and actual returns
✅ Entropy bonus: Encourage exploration
Training Loop
- Collect rollouts using current policy
- Compute advantages using GAE
- Multiple epochs of minibatch updates (e.g., 4 epochs)
- Update policy with clipped objective
- Update value function with MSE loss
- Log metrics: KL divergence, clip fraction, explained variance
KL Penalty vs Clipping
Two ways to constrain policy updates:
- Clipping (PPO-Clip): Hard constraint via clipping
- KL Penalty (PPO-KL): Soft constraint via penalty term
In practice: PPO-Clip is simpler and works just as well. Use that.
Hyperparameter Tuning
Critical hyperparameters:
- clip_eps (╬╡): 0.1ΓÇô0.3 (default 0.2)
- GAE lambda (╬╗): 0.9ΓÇô0.99 (default 0.95)
- Learning rate: 3e-4 (tune with lr schedule)
- Minibatch size: 64ΓÇô256
- Number of epochs: 3ΓÇô10 (watch for overfitting)
- Entropy coefficient: 0.01 (decay over training)
Tuning strategy: Start with defaults, watch KL divergence and clip fraction.
Evaluation Discipline
Metrics to Track
✅ Episode return (mean, std, min, max)
✅ KL divergence (should be small, less than 0.05)
✅ Clip fraction (what % of updates were clipped)
✅ Explained variance (how well V predicts returns)
✅ Entropy (should decay slowly)
✅ Policy loss, value loss
Debugging Checklist
❌ KL divergence exploding? Reduce learning rate or clip_eps
❌ Clip fraction = 1? Policy changing too fast, reduce LR
❌ Explained variance negative? Value function broken, check value loss
❌ Entropy going to 0 too fast? Increase entropy coefficient
GRPO: Group Relative Policy Optimization
Recent variant: Instead of comparing to a value function baseline, use group-wise relative rewards.
Idea: Normalize rewards within each batch/group before computing advantages.
Benefit: Simpler (no value function), more stable for some tasks.
Key Resources
📚 Essential Reading
PPO and GRPO Comparison (Yugeten)
https://yugeten.github.io/posts/2025/01/ppogrpo/
Deep dive into PPO implementation details and comparison with GRPO. Must-read for implementation.
The RLHF Book
https://rlhfbook.com/
Chapters on PPO for LLM post-training with code examples.
📖 Books
Foundations of Deep Reinforcement Learning by Graesser & Keng
Chapter on PPO with PyTorch implementation.
Learning Path
Phase 1: Understand the Theory (4 hours)
- Review importance sampling & probability ratios
- Derive clipped objective from first principles
- Read Yugeten PPO/GRPO post
Phase 2: Implement (8 hours)
- Implement PPO on CartPole or MuJoCo
- Track all key metrics (KL, clip fraction, explained variance)
- Debug until you get smooth learning curves
- Compare to baseline implementation (e.g., Stable-Baselines3)
Phase 3: Deep Dive (3 hours)
- Implement GRPO variant
- Compare PPO vs GRPO on same task
- Read RLHF Book chapters on LLM post-training with PPO
Common Pitfalls
❌ Not normalizing advantages: Always normalize per-batch.
❌ Too many epochs: Overfitting leads to high KL divergence.
❌ Forgetting old policy probabilities: Store log probs from rollouts!
❌ Wrong advantage signs: Double-check your advantage computation.
Γ¥î Ignoring clip fraction: If it’s 0 or 1, something’s wrong.
Next Steps
- DPO-Family Competence: Modern alternatives to PPO (offline RL)
- Online Loops & Stability: Iterative PPO training with reward model updates
Assessment Criteria
✅ You understand this node when you can:
- Implement PPO from scratch with all bells and whistles
- Explain why clipping constrains policy updates
- Tune hyperparameters based on logged metrics
- Debug training instabilities (KL explosion, clip fraction issues)
- Compare PPO to GRPO and articulate trade-offs