Overview
Your living playbook for ultra-scale distributed training. Distill practical heuristics for 5D parallelism, overlap strategies, gradient bucketing, and the ZeRO + Tensor Parallelism interplay.
This is the “scale bible” ΓÇö a constantly updated reference as you encounter real bottlenecks.
Key Concepts
📋 Concepts
The 5D Parallelism Stack
Modern large-scale training combines multiple parallelism strategies:
- Data Parallelism (DP): Replicate model across workers
- Tensor Parallelism (TP): Shard individual layers across GPUs
- Pipeline Parallelism (PP): Split model layers across stages
- Sequence Parallelism (SP): Shard sequence dimension for long-context models
- Context Parallelism (CP): Ring-style attention for ultra-long sequences
The art: Knowing which dimensions to parallelize for your model + hardware.
Overlap Strategies
Compute-Communication Overlap
Key insight: Don’t wait for communication to finish before computing.
- Gradient bucketing: AllReduce gradients in buckets as they’re computed
- Pipeline bubbles: Fill bubbles with microbatches
- Prefetching: Load next batch while current batch is computing
ZeRO + Tensor Parallelism Interplay
Naive approach: Apply ZeRO-3 + TP independently leads to excessive communication
Better approach:
- Use TP within nodes (high bandwidth)
- Use ZeRO across nodes (lower bandwidth)
- Hybrid sharding:
HYBRID_SHARDin PyTorch FSDP
Gradient Bucketing
Problem: Waiting for all gradients before AllReduce wastes time.
Solution: Bucket gradients and start AllReduce as soon as each bucket is ready.
Tuning: Bucket size trades off:
- Small buckets: more overlap, more kernel launches
- Large buckets: less overlap, fewer kernel launches
Heuristic: 25MB buckets is a good default for most models.
Memory Budgeting
For a transformer with N parameters:
- Parameters: N
- Gradients: N
- Optimizer states (AdamW): 2N (momentum + variance)
- Activations: Depends on batch size, sequence length, hidden size
ZeRO-3 savings: (N + N + 2N) / num_gpus = 4N / num_gpus
Activation checkpointing: Trade compute for memory (recompute activations in backward pass)
Communication-Compute Ratio
Good scaling: Communication time much less than Compute time
Rule of thumb: Aim for greater than 10:1 compute-to-communication ratio.
How to achieve:
- Increase batch size (more compute per communication)
- Use faster interconnect (NVLink better than PCIe, InfiniBand better than Ethernet)
- Reduce communication frequency (gradient accumulation)
Key Resources
📚 Essential Playbooks
Nanotron Ultra-Scale Playbook
https://huggingface.co/spaces/nanotron/ultrascale-playbook
The definitive reference for ultra-scale training heuristics. Covers 5D parallelism interplay, overlap strategies, and practical recipes for scaling to thousands of GPUs. This is your bible.
Smol Training Playbook (complementary)
https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook
Learning Path
Phase 1: Understand the Dimensions (10 hours)
- Read Nanotron Ultra-Scale Playbook cover-to-cover
- Map each parallelism type to a concrete use case
- Sketch communication patterns for hybrid TP+ZeRO
Phase 2: Hands-On Profiling (12 hours)
- Profile a 7B model with different parallelism configs
- Measure communication-compute ratio
- Experiment with gradient bucketing sizes
- Compare memory usage: ZeRO-1 vs ZeRO-2 vs ZeRO-3
Phase 3: Build Your Heuristics Library (8 hours)
- Document your own scaling rules for your hardware
- Create a decision tree: “Given model size X and N GPUs, use…”
- Benchmark and record actual throughput numbers
Practical Heuristics (Living Document)
Model Size Decision Tree
Under 1B params:
- Pure DP (simplest, fastest)
- ZeRO-1 if memory tight
1B - 13B params:
- DP + ZeRO-2
- Consider TP=2 within nodes if very memory constrained
13B - 70B params:
- DP + ZeRO-3 or FSDP
- TP=2 or TP=4 within nodes
- PP=2 if model still doesn’t fit
Over 70B params:
- Full 5D parallelism
- TP=4 or TP=8 within nodes
- PP=4+ across nodes
- ZeRO-3 or hybrid sharding
- Consider SP/CP for long-context variants
Overlap Checklist
✅ Gradient bucketing enabled (25MB buckets)
✅ Data prefetching in dataloader
✅ Pipeline parallelism with microbatches (8-16 microbatches)
✅ Activation checkpointing for memory-bound models
✅ Mixed precision (bf16) to reduce communication volume
Common Pitfalls
❌ Over-parallelizing: More parallelism ≠ faster. Communication overhead can dominate.
Γ¥î Ignoring hardware topology: Don’t use TP across nodes (slow interconnect).
❌ Skipping profiling: Measure before optimizing. Your intuition will be wrong.
Γ¥î Cargo-culting configs: What works for LLaMA 70B won’t work for a video DiT.
Next Steps
Next Steps:
- Long-Context Parallelism: Deep dive into SP/CP for 1M+ token contexts
- Inference Serving: Apply parallelism thinking to latency-constrained serving
Maintenance
This is a living document. As you encounter new bottlenecks or discover better heuristics, update this page with:
- Empirical throughput numbers
- Hardware-specific tuning
- Model-specific quirks (e.g., DiT vs Transformer parallelism)