Overview
Master the art of iterative online RL training. Understand distribution shift, reward hacking, and the stability knobs needed to keep training on track when both policy and reward model evolve.
Goal: Build robust online RL systems that don’t collapse or hack rewards.
Key Concepts
📋 Concepts
Online RL Loop
Classic RL: Fixed environment, train policy.
RLHF/Online RL: Environment (reward model) and policy both evolve.
Loop:
- Collect data with current policy
- Train reward model on new preferences
- Run PPO to optimize against
- Repeat
Challenge: How to keep this stable when both and are moving targets?
Distribution Shift
Problem: Reward model trained on data from , but we’re using it to evaluate .
Consequence: Reward model becomes miscalibrated on new policy outputs.
Symptoms:
- Reward scores inflate without quality improvement
- Policy generates adversarial examples that fool RM
- Human raters disagree with RM scores
Solutions:
- KL regularization: Keep policy close to reference
- Iterative RM updates: Collect new preferences on outputs
- Ensemble RMs: Use multiple RMs to detect disagreement
Reward Hacking
Definition: Policy learns to exploit flaws in the reward model without improving actual quality.
Examples:
- Generating verbose but low-content responses (RM prefers length)
- Using rare tokens RM hasn’t seen (RM assigns high uncertainty - high reward)
- Repeating patterns RM can’t detect (e.g., subtle repetition)
Detection:
- Human eval diverges from RM: RM says great, humans say bad
- RM ensemble disagrees: RMs give wildly different scores
- KL explodes: Policy strays far from reference
Mitigation:
- KL penalty:
- RM ensemble: Only trust reward if all RMs agree
- Human-in-the-loop: Regular human audits
- Adversarial RM training: Collect hacked examples, retrain RM
KL Regularization
Objective:
Intuition: Optimize reward but don’t stray too far from reference policy.
Tuning :
- Too small - reward hacking
- Too large - policy doesn’t improve
- Typical range: 0.01ΓÇô0.1 (tune empirically)
Adaptive KL: Use KL budget that adjusts based on training progress.
Reward Model Updates
When to update RM:
- ✅ After every N iterations (e.g., every 5 PPO rounds)
- ✅ When KL divergence crosses threshold (policy drifted too far)
- ✅ When human eval shows RM miscalibration
How to update RM:
- Collect new preference data from current policy outputs
- Finetune RM (don’t reset!) on mixed dataset (old + new)
- Validate RM on held-out preferences
- If RM quality degrades, rollback or use ensemble
Risk: RM overfitting to current policy - amplifies hacking.
Solution: Keep diverse preference dataset spanning many policy checkpoints.
Stability Knobs
Learning Rate Schedules
- Start with high LR for fast initial progress
- Decay LR as policy stabilizes
- Use warmup for RM updates
KL Budget Management
- Start with low KL budget (conservative)
- Gradually increase as RM gets more data
- Monitor KL per-iteration, not just cumulative
RM Freezing Periods
- Freeze RM for N iterations to let policy catch up
- Update RM only when policy plateaus
Checkpointing & Rollback
- Save policy checkpoints every iteration
- If reward hacking detected, rollback to last good checkpoint
LLM Learning Dynamics
SFT - PPO - DPO progression:
- SFT (Supervised Fine-Tuning): Learn from demonstrations
- Fast initial progress
- Upper bounded by demo quality
- PPO (Online RL): Optimize reward model
- Can surpass demos
- Requires careful tuning
- DPO (Direct Preference Optimization): Offline preference learning
- Simpler, no reward model
- Less prone to hacking
Key insight: Different algorithms shine at different stages of training.
Key Resources
📚 Essential Papers
Understanding LLM Learning Dynamics (arXiv:2407.10490)
https://arxiv.org/abs/2407.10490
Deep analysis of SFT, PPO, and DPO learning dynamics for LLMs. Essential for understanding when each algorithm works best and how they interact. Must-read for RLHF practitioners.
The RLHF Book
https://rlhfbook.com/
Chapters on online training, distribution shift, and reward hacking with practical advice.
Learning Path
Phase 1: Understand the Dynamics (6 hours)
- Read arXiv:2407.10490 (LLM learning dynamics)
- Study RLHF Book chapters on online loops
- Understand failure modes (reward hacking, distribution shift)
Phase 2: Implement (10 hours)
- Build online RL loop: PPO + iterative RM updates
- Implement KL regularization with tunable (\beta)
- Add monitoring: KL divergence, RM calibration, human eval
- Simulate reward hacking and test mitigation strategies
Phase 3: Stability Engineering (4 hours)
- Experiment with different RM update frequencies
- Tune KL budget over training
- Implement RM ensemble for hacking detection
- Build rollback system for catastrophic failures
Common Pitfalls
❌ Never updating RM: Policy will eventually hack a fixed RM.
❌ Updating RM too often: RM overfits to current policy.
❌ No KL regularization: Guaranteed reward hacking.
❌ Ignoring human eval: RM scores become meaningless without ground truth.
Γ¥î No rollback plan: When training goes off the rails, you’re stuck.
Next Steps
- Search / Test-Time Compute: Use RMs at inference time for best-of-N sampling
- RL for Structured Outputs: Apply online RL to constrained generation (layouts, CDFs)
Assessment Criteria
✅ You understand this node when you can:
- Implement a full online RL loop with iterative RM updates
- Detect and mitigate reward hacking
- Tune KL regularization based on training dynamics
- Monitor distribution shift and RM calibration
- Explain SFT - PPO - DPO progression
- Build robust checkpointing and rollback systems