PPO from Scratch

Overview

Implement Proximal Policy Optimization (PPO) from scratch and build strong evaluation discipline. PPO is the workhorse of modern RLHF and the foundation for understanding post-training.

Goal: Build a production-grade PPO implementation you can trust and extend.

PPO Clipped Surrogate Objective Importance Sampling & Probability Ratios KL Penalty vs Clipping Full PPO Implementation Hyperparameter Tuning (clip_eps, GAE lambda) Evaluation Discipline & Debugging GRPO: Group Relative Policy Optimization

Why PPO?

Problem with vanilla policy gradients: Large policy updates can catastrophically degrade performance.

PPO solution: Constrain policy updates to a “trust region” using a clipped objective.

The PPO Objective

Clipped surrogate objective:

L^{CLIP}(\theta) = \mathbb{E}_t \left[ \min(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t) \right]

Where:

$r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}$ : Probability ratio (importance sampling)
$\hat{A}_t$ : Advantage estimate (from GAE)
$\epsilon$ : Clip range (typically 0.2)

Intuition: If ratio greater than 1+╬╡ or less than 1-╬╡, clip it. Don’t let policy change too much.

Implementation Checklist

Core Components

Γ£à Policy network ( $\pi_\theta$ ): Output action probabilities
Γ£à Value network ( $V_\phi$ ): Estimate state values
Γ£à Advantage computation (GAE with $\lambda = 0.95$ )
Γ£à Probability ratio: $r_t = \pi_{\text{new}} / \pi_{\text{old}}$
Γ£à Clipped objective: $\min(r \cdot A, \text{clip}(r, 1-\epsilon, 1+\epsilon) \cdot A)$
Γ£à Value loss: MSE between predicted and actual returns
Γ£à Entropy bonus: Encourage exploration

Training Loop

Collect rollouts using current policy
Compute advantages using GAE
Multiple epochs of minibatch updates (e.g., 4 epochs)
Update policy with clipped objective
Update value function with MSE loss
Log metrics: KL divergence, clip fraction, explained variance

KL Penalty vs Clipping

Two ways to constrain policy updates:

Clipping (PPO-Clip): Hard constraint via clipping
KL Penalty (PPO-KL): Soft constraint via penalty term

L^{KL}(\theta) = \mathbb{E}_t \left[ r_t(\theta) \hat{A}_t - \beta \cdot \text{KL}(\pi_{\theta_{\text{old}}} || \pi_\theta) \right]

In practice: PPO-Clip is simpler and works just as well. Use that.

Hyperparameter Tuning

Critical hyperparameters:

clip_eps (╬╡): 0.1ΓÇô0.3 (default 0.2)
GAE lambda (╬╗): 0.9ΓÇô0.99 (default 0.95)
Learning rate: 3e-4 (tune with lr schedule)
Minibatch size: 64ΓÇô256
Number of epochs: 3ΓÇô10 (watch for overfitting)
Entropy coefficient: 0.01 (decay over training)

Tuning strategy: Start with defaults, watch KL divergence and clip fraction.

Evaluation Discipline

Metrics to Track

Γ£à Episode return (mean, std, min, max)
Γ£à KL divergence (should be small, less than 0.05)
Γ£à Clip fraction (what % of updates were clipped)
Γ£à Explained variance (how well V predicts returns)
Γ£à Entropy (should decay slowly)
Γ£à Policy loss, value loss

Debugging Checklist

Γ¥î KL divergence exploding? Reduce learning rate or clip_eps
Γ¥î Clip fraction = 1? Policy changing too fast, reduce LR
Γ¥î Explained variance negative? Value function broken, check value loss
Γ¥î Entropy going to 0 too fast? Increase entropy coefficient