Online Loops & Stability

Overview

Master the art of iterative online RL training. Understand distribution shift, reward hacking, and the stability knobs needed to keep training on track when both policy and reward model evolve.

Goal: Build robust online RL systems that don’t collapse or hack rewards.

Online RL: Data Collection + Training Loop Distribution Shift: Policy vs Reward Model Reward Hacking: Detection & Mitigation KL Regularization: Reference Policy Anchoring Reward Model Updates: When & How Stability Knobs: LR schedules, KL budgets, RM freezing LLM Learning Dynamics: SFT - PPO - DPO

Online RL Loop

Classic RL: Fixed environment, train policy.

RLHF/Online RL: Environment (reward model) and policy both evolve.

Loop:

Collect data with current policy $\pi_\theta$
Train reward model $r_\phi$ on new preferences
Run PPO to optimize $\pi_\theta$ against $r_\phi$
Repeat

Challenge: How to keep this stable when both $\pi$ and $r$ are moving targets?

Distribution Shift

Problem: Reward model trained on data from $\pi_{\text{old}}$ , but we’re using it to evaluate $\pi_{\text{new}}$ .

Consequence: Reward model becomes miscalibrated on new policy outputs.

Symptoms:

Reward scores inflate without quality improvement
Policy generates adversarial examples that fool RM
Human raters disagree with RM scores

Solutions:

KL regularization: Keep policy close to reference $\pi_{\text{ref}}$
Iterative RM updates: Collect new preferences on $\pi_{\text{new}}$ outputs
Ensemble RMs: Use multiple RMs to detect disagreement

Reward Hacking

Definition: Policy learns to exploit flaws in the reward model without improving actual quality.

Examples:

Generating verbose but low-content responses (RM prefers length)
Using rare tokens RM hasn’t seen (RM assigns high uncertainty - high reward)
Repeating patterns RM can’t detect (e.g., subtle repetition)

Detection:

Human eval diverges from RM: RM says great, humans say bad
RM ensemble disagrees: RMs give wildly different scores
KL explodes: Policy strays far from reference

Mitigation:

KL penalty: $\text{reward} = r_\phi(\text{output}) - \beta \cdot \text{KL}(\pi || \pi_{\text{ref}})$
RM ensemble: Only trust reward if all RMs agree
Human-in-the-loop: Regular human audits
Adversarial RM training: Collect hacked examples, retrain RM

KL Regularization

Objective:

\max_\theta \mathbb{E}_{x, y \sim \pi_\theta} \left[ r_\phi(x, y) - \beta \cdot \text{KL}(\pi_\theta(y|x) || \pi_{\text{ref}}(y|x)) \right]

Intuition: Optimize reward but don’t stray too far from reference policy.

Tuning $\beta$ :

Too small - reward hacking
Too large - policy doesn’t improve
Typical range: 0.01ΓÇô0.1 (tune empirically)

Adaptive KL: Use KL budget that adjusts based on training progress.

Reward Model Updates

When to update RM:

Γ£à After every N iterations (e.g., every 5 PPO rounds)
Γ£à When KL divergence crosses threshold (policy drifted too far)
Γ£à When human eval shows RM miscalibration

How to update RM:

Collect new preference data from current policy outputs
Finetune RM (don’t reset!) on mixed dataset (old + new)
Validate RM on held-out preferences
If RM quality degrades, rollback or use ensemble

Risk: RM overfitting to current policy - amplifies hacking.

Solution: Keep diverse preference dataset spanning many policy checkpoints.

Stability Knobs

Learning Rate Schedules

Start with high LR for fast initial progress
Decay LR as policy stabilizes
Use warmup for RM updates

KL Budget Management

Start with low KL budget (conservative)
Gradually increase as RM gets more data
Monitor KL per-iteration, not just cumulative

RM Freezing Periods

Freeze RM for N iterations to let policy catch up
Update RM only when policy plateaus

Checkpointing & Rollback

Save policy checkpoints every iteration
If reward hacking detected, rollback to last good checkpoint

LLM Learning Dynamics

SFT - PPO - DPO progression:

SFT (Supervised Fine-Tuning): Learn from demonstrations
- Fast initial progress
- Upper bounded by demo quality
PPO (Online RL): Optimize reward model
- Can surpass demos
- Requires careful tuning
DPO (Direct Preference Optimization): Offline preference learning
- Simpler, no reward model
- Less prone to hacking

Key insight: Different algorithms shine at different stages of training.

Key Resources

≡ƒôÜ Essential Papers

Understanding LLM Learning Dynamics (arXiv:2407.10490)
https://arxiv.org/abs/2407.10490

Deep analysis of SFT, PPO, and DPO learning dynamics for LLMs. Essential for understanding when each algorithm works best and how they interact. Must-read for RLHF practitioners.

The RLHF Book
https://rlhfbook.com/

Chapters on online training, distribution shift, and reward hacking with practical advice.

Learning Path

Phase 1: Understand the Dynamics (6 hours)

Read arXiv:2407.10490 (LLM learning dynamics)
Study RLHF Book chapters on online loops
Understand failure modes (reward hacking, distribution shift)

Phase 2: Implement (10 hours)

Build online RL loop: PPO + iterative RM updates
Implement KL regularization with tunable (\beta)
Add monitoring: KL divergence, RM calibration, human eval
Simulate reward hacking and test mitigation strategies

Phase 3: Stability Engineering (4 hours)

Experiment with different RM update frequencies
Tune KL budget over training
Implement RM ensemble for hacking detection
Build rollback system for catastrophic failures

Common Pitfalls

Γ¥î Never updating RM: Policy will eventually hack a fixed RM.

Γ¥î Updating RM too often: RM overfits to current policy.

Γ¥î No KL regularization: Guaranteed reward hacking.

Γ¥î Ignoring human eval: RM scores become meaningless without ground truth.

Γ¥î No rollback plan: When training goes off the rails, you’re stuck.

Next Steps

- Search / Test-Time Compute: Use RMs at inference time for best-of-N sampling
- RL for Structured Outputs: Apply online RL to constrained generation (layouts, CDFs)

Assessment Criteria

Γ£à You understand this node when you can:

Implement a full online RL loop with iterative RM updates
Detect and mitigate reward hacking
Tune KL regularization based on training dynamics
Monitor distribution shift and RM calibration
Explain SFT - PPO - DPO progression
Build robust checkpointing and rollback systems