Ultra-Scale Heuristics Distillation

Overview

Your living playbook for ultra-scale distributed training. Distill practical heuristics for 5D parallelism, overlap strategies, gradient bucketing, and the ZeRO + Tensor Parallelism interplay.

This is the “scale bible” ΓÇö a constantly updated reference as you encounter real bottlenecks.

5D Parallelism: DP + TP + PP + SP + CP Overlap: Compute, Communication, I/O Gradient Bucketing & AllReduce Optimization ZeRO + Tensor Parallelism Interplay Memory Budgeting: Activation, Gradients, Optimizer States Communication-Compute Ratio Analysis Empirical Scaling Laws for Parallelism

The 5D Parallelism Stack

Modern large-scale training combines multiple parallelism strategies:

Data Parallelism (DP): Replicate model across workers
Tensor Parallelism (TP): Shard individual layers across GPUs
Pipeline Parallelism (PP): Split model layers across stages
Sequence Parallelism (SP): Shard sequence dimension for long-context models
Context Parallelism (CP): Ring-style attention for ultra-long sequences

The art: Knowing which dimensions to parallelize for your model + hardware.

Overlap Strategies

Compute-Communication Overlap

Key insight: Don’t wait for communication to finish before computing.

Gradient bucketing: AllReduce gradients in buckets as they’re computed
Pipeline bubbles: Fill bubbles with microbatches
Prefetching: Load next batch while current batch is computing

ZeRO + Tensor Parallelism Interplay

Naive approach: Apply ZeRO-3 + TP independently leads to excessive communication

Better approach:

Use TP within nodes (high bandwidth)
Use ZeRO across nodes (lower bandwidth)
Hybrid sharding: HYBRID_SHARD in PyTorch FSDP

Gradient Bucketing

Problem: Waiting for all gradients before AllReduce wastes time.

Solution: Bucket gradients and start AllReduce as soon as each bucket is ready.

Tuning: Bucket size trades off:

Small buckets: more overlap, more kernel launches
Large buckets: less overlap, fewer kernel launches

Heuristic: 25MB buckets is a good default for most models.

Memory Budgeting

For a transformer with N parameters:

Parameters: N
Gradients: N
Optimizer states (AdamW): 2N (momentum + variance)
Activations: Depends on batch size, sequence length, hidden size

ZeRO-3 savings: (N + N + 2N) / num_gpus = 4N / num_gpus

Activation checkpointing: Trade compute for memory (recompute activations in backward pass)

Communication-Compute Ratio

Good scaling: Communication time much less than Compute time

Rule of thumb: Aim for greater than 10:1 compute-to-communication ratio.

How to achieve:

Increase batch size (more compute per communication)
Use faster interconnect (NVLink better than PCIe, InfiniBand better than Ethernet)
Reduce communication frequency (gradient accumulation)

Key Resources

≡ƒôÜ Essential Playbooks

Nanotron Ultra-Scale Playbook
https://huggingface.co/spaces/nanotron/ultrascale-playbook

The definitive reference for ultra-scale training heuristics. Covers 5D parallelism interplay, overlap strategies, and practical recipes for scaling to thousands of GPUs. This is your bible.

Smol Training Playbook (complementary)
https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook

Learning Path

Phase 1: Understand the Dimensions (10 hours)

Read Nanotron Ultra-Scale Playbook cover-to-cover
Map each parallelism type to a concrete use case
Sketch communication patterns for hybrid TP+ZeRO

Phase 2: Hands-On Profiling (12 hours)

Profile a 7B model with different parallelism configs
Measure communication-compute ratio
Experiment with gradient bucketing sizes
Compare memory usage: ZeRO-1 vs ZeRO-2 vs ZeRO-3

Phase 3: Build Your Heuristics Library (8 hours)

Document your own scaling rules for your hardware
Create a decision tree: “Given model size X and N GPUs, use…”
Benchmark and record actual throughput numbers

Practical Heuristics (Living Document)

Model Size Decision Tree

Under 1B params:

Pure DP (simplest, fastest)
ZeRO-1 if memory tight

1B - 13B params:

DP + ZeRO-2
Consider TP=2 within nodes if very memory constrained

13B - 70B params:

DP + ZeRO-3 or FSDP
TP=2 or TP=4 within nodes
PP=2 if model still doesn’t fit

Over 70B params:

Full 5D parallelism
TP=4 or TP=8 within nodes
PP=4+ across nodes
ZeRO-3 or hybrid sharding
Consider SP/CP for long-context variants

Overlap Checklist

Γ£à Gradient bucketing enabled (25MB buckets)
Γ£à Data prefetching in dataloader
Γ£à Pipeline parallelism with microbatches (8-16 microbatches)
Γ£à Activation checkpointing for memory-bound models
Γ£à Mixed precision (bf16) to reduce communication volume

Common Pitfalls

Γ¥î Over-parallelizing: More parallelism Γëá faster. Communication overhead can dominate.

Γ¥î Ignoring hardware topology: Don’t use TP across nodes (slow interconnect).

Γ¥î Skipping profiling: Measure before optimizing. Your intuition will be wrong.

Γ¥î Cargo-culting configs: What works for LLaMA 70B won’t work for a video DiT.

Next Steps

Next Steps:

Long-Context Parallelism: Deep dive into SP/CP for 1M+ token contexts
Inference Serving: Apply parallelism thinking to latency-constrained serving

Maintenance

This is a living document. As you encounter new bottlenecks or discover better heuristics, update this page with:

Empirical throughput numbers
Hardware-specific tuning
Model-specific quirks (e.g., DiT vs Transformer parallelism)