Overview

Build the foundational understanding of policy gradient methods from first principles. Derive the policy gradient theorem, implement REINFORCE, and understand the actor-critic paradigm.

Goal: Deeply understand why we can optimize policies directly by maximizing expected reward.

Key Concepts

📋 Concepts

0 / 5 mastered
0%

The Policy Gradient Theorem

Core idea: Instead of learning Q-values (value-based RL), directly parameterize and optimize the policy πθ(as)\pi_\theta(a|s).

Policy Gradient Theorem:

θJ(θ)=Eτπθ[t=0Tθlogπθ(atst)R(τ)]\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \left[ \sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot R(\tau) \right]

Where:

  • θ\theta: Policy parameters
  • τ\tau: Trajectory (sequence of states, actions)
  • R(τ)R(\tau): Return (total reward)

Intuition: Increase probability of actions that led to high reward.

REINFORCE Algorithm

The simplest policy gradient algorithm:

  1. Collect trajectories by running policy (\pi_\theta)
  2. Compute returns (R(\tau)) for each trajectory
  3. Update policy: (\theta \leftarrow \theta + \alpha \nabla\theta \log \pi\theta \cdot R(\tau))

Problem: High variance! Single trajectory return is noisy.

Solution: Baselines and advantage estimation.

Variance Reduction: Baselines

Key insight: Subtract a baseline from returns without biasing the gradient.

θJ(θ)=Eτ[θlogπθ(atst)(R(τ)b(st))]\nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \nabla_\theta \log \pi_\theta(a_t | s_t) \cdot (R(\tau) - b(s_t)) \right]

Common baseline: Value function V(st)V(s_t) leads to actor-critic.

Actor-Critic Architecture

Two networks:

  • Actor: Policy πθ(as)\pi_\theta(a|s) (what to do)
  • Critic: Value function Vϕ(s)V_\phi(s) (how good is this state)

Advantage: A(s,a)=Q(s,a)V(s)A(s, a) = Q(s, a) - V(s)

Update rule:

θθ+αθlogπθ(as)A(s,a)\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a|s) \cdot A(s, a)

Benefit: Lower variance than pure REINFORCE, faster learning.

Generalized Advantage Estimation (GAE)

Problem: Bias-variance trade-off in advantage estimation.

GAE solution: Exponentially weighted average of n-step advantages.

A^t=l=0(γλ)lδt+l\hat{A}_t = \sum_{l=0}^\infty (\gamma \lambda)^l \delta_{t+l}

Where δt=rt+γV(st+1)V(st)\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) (TD error).

Tuning λ\lambda:

  • λ=0\lambda = 0: Low variance, high bias (TD learning)
  • λ=1\lambda = 1: High variance, low bias (Monte Carlo)
  • λ=0.95\lambda = 0.95: Good default

Key Resources

📚 Essential Reading

REINFORCE Algorithm Tutorial (Substack)
https://substack.com/inbox/post/170790602

Clear walkthrough of the REINFORCE algorithm with derivations.

The RLHF Book
https://rlhfbook.com/

Comprehensive resource for RL in the context of LLM post-training. Covers policy gradients, PPO, and RLHF applications. Essential for applied RL.

📖 Books

Foundations of Deep Reinforcement Learning by Graesser & Keng

Practical, code-first approach to deep RL. Includes PyTorch implementations of REINFORCE, actor-critic, and PPO.

Learning Path

Phase 1: Theory (4 hours)

  1. Derive policy gradient theorem from scratch
  2. Read REINFORCE tutorial
  3. Work through actor-critic derivation

Phase 2: Implementation (4 hours)

  1. Implement REINFORCE on CartPole
  2. Add baseline (value function)
  3. Implement actor-critic
  4. Compare variance: REINFORCE vs actor-critic

Phase 3: Deep Dive (2 hours)

  1. Read RLHF Book chapter on policy gradients
  2. Implement GAE
  3. Understand connection to PPO (next node)

Common Pitfalls

Γ¥î Forgetting the log: It’s logπ\nabla \log \pi, not π\nabla \pi.

❌ Biased baselines: Only state-dependent baselines are unbiased.

❌ Ignoring variance: High variance = slow/unstable learning. Always use baselines.

❌ Wrong advantage signs: Positive advantage - increase action probability.

Next Steps

    • PPO from Scratch: The de facto standard policy gradient algorithm
    • Preference Data & Eval Design: How to get reward signals from human preferences

Assessment Criteria

✅ You understand this node when you can:

  • Derive the policy gradient theorem
  • Implement REINFORCE from scratch
  • Explain why baselines reduce variance without bias
  • Code up an actor-critic agent
  • Implement GAE and tune (\lambda)