Sai's GPU Brrr Reading List

GPU Brrr Reading List

I’ve found myself reading articles, blogs, papers and lecture notes across {ML,SWE,Maths}\{\text{ML}, \text{SWE}, \text{Maths}\} as a form of passive learning, then turning the useful bits into implementation later.

This page is just the GPU/performance side of that habit. How do we make the GPU go brr? The answer?

  1. Understand the expensive bit of hardward
  2. Measure and profile
  3. Attack wasted movement, launches and idle time, keep everything close together
  4. Fuse it back together.

Act 1: The GPU Hierarchy

Three hierarchies get taught together because they interact constantly: execution hierarchy, hardware hierarchy and memory hierarchy.

  • Execution: grid, blocks, warps, threads.
  • Hardware: GPU, SMs, tensor cores / CUDA cores.
  • Memory: registers, shared memory / SRAM, L2, HBM.

Small mental note: each block runs on an SM (Streaming Multiprocessor, think like a CPU); the GPU chip is the grid. Within the hierarchy, registers are closest/fastest, then shared memory, then global memory/HBM.

Core idea: compute is cheap; moving data is expensive. Most of the resources below are different ways of moving fewer bytes, moving smaller bytes, reusing bytes more often, or keeping the GPU busy while bytes are moving.

Resources

Act 1.5: The GPU Is a Moving Target

There is not one GPU. Each generation moves the numbers: HBM bandwidth and capacity, tensor-core throughput, supported precision formats and async data-movement machinery.

That matters because the best optimization depends on the hardware. Ampere, Hopper and Blackwell do not have the same precision formats, memory hierarchy details, or kernel sweet spots.

Resources

Act 2: Performance Models and Rooflines

Before touching a kernel, identify which resource is limiting it:

  • Compute: math units are saturated.
  • Memory: math units are waiting for bytes.
  • Overhead: launches/setup/synchronization dominate.

The roofline model is the main picture: low arithmetic intensity lives on the memory-bandwidth diagonal; high arithmetic intensity hits the compute ceiling. A lot of deep learning optimization is pushing work up and right by increasing reuse or avoiding materialization.

Resources

Act 3: Profiling, torch.compile and Fusion

The first optimization baseline is measurement. After that, the first automatic optimization is often compilation/fusion: capture the graph, remove Python overhead, fuse operations, and avoid HBM round-trips for intermediates.

The catch: graph capture is shape-sensitive. Variable sequence length, video resolution, frame count and batch shape can turn “free speedup” into repeated recompilation unless shapes are bucketed or handled deliberately.

Resources

Act 4: Custom Kernels and FlashAttention

Do not materialize large intermediates in HBM if they can be tiled, streamed, fused or consumed immediately. Do as much as you can in SRAM, or the shared memory.

Matmul is the warmup example: naive global-memory reads, then coalescing, shared-memory tiling, register tiling, and eventually more hardware-specific tricks like async copies and warp specialization.

FlashAttention is the attention version of the same idea: never build the full N×NN \times N score matrix in HBM. Tile the computation, keep the online softmax state, and stream blocks through fast memory.

Kernel Resources

FlashAttention Resources

Act 5: Multi-GPU Training and Communication

Once one GPU is not enough, the memory hierarchy extends across devices. Moving data between GPUs is another expensive rung, so the same “move less / overlap more” principle returns.

Core vocabulary:

  • Data parallelism: replicate model, split data, all-reduce gradients.
  • Tensor parallelism: split layer math across devices.
  • Pipeline parallelism: split depth into stages; watch for bubbles.
  • FSDP / ZeRO: shard parameters, gradients and optimizer states.
  • Collectives: all-reduce, all-gather, reduce-scatter.

Resources

Act 6: Inference and Serving

Serving has a different performance shape from training. You are juggling requests, the KV cache, prefill/decode split, batching and scheduling.

  • Prefill: lots of prompt tokens in parallel, often compute-heavy.
  • Decode: one token at a time, repeatedly reading KV cache, often memory-bound.
  • PagedAttention: treat KV cache like paged virtual memory to reduce fragmentation and increase concurrency.

Resources

Act 7: Quantization, Smaller Bytes and the Right Kernel

If a workload is memory-bound and you cannot move fewer values, move smaller values. Quantization reduces memory footprint and bandwidth demand, but it changes the accuracy and kernel-choice story.

The right kernel depends on the workload. Dequant-then-compute kernels can be good for memory-bound decode; native low-precision GEMM can be better for compute-heavy prefill on new hardware.

Resources

Act 8: monokernels and killing boundaries

if kernel boundaries, launch overhead and stragglers waste time, one extreme answer is to fuse much more aggressively. A monokernel/megakernel tries to keep the GPU busy by removing boundaries, loading ahead, and avoiding idle bubbles.

Resources

Compiler and Runtime Side Quests

Not everything is a Transformer. Tree inference and compiler/runtime work are also part of the performance map when the theme is “make model execution faster.”

People and Feeds

People whose posts tend to feed this GPU/performance list.

  • Simon Boehm - performance, inference and compiler-flavoured ML systems.
  • Horace He - performance models and PyTorch internals.
  • Tri Dao - attention, sequence models and systems-aware algorithms.
  • Tim Dettmers - quantization and efficient training/inference.
  • Mark Saroufim - PyTorch, kernels and production ML systems.
  • Jino Rohit - distributed systems, Torch and collective communication notes.
  • Ali Taha - model performance and video performance threads.
  • Thien Tran - GPU kernels and systems notes.
  • Daniel Han Chen - Unsloth / model efficiency.