GPU Brrr Reading List
I’ve found myself reading articles, blogs, papers and lecture notes across as a form of passive learning, then turning the useful bits into implementation later.
This page is just the GPU/performance side of that habit. How do we make the GPU go brr? The answer?
- Understand the expensive bit of hardward
- Measure and profile
- Attack wasted movement, launches and idle time, keep everything close together
- Fuse it back together.
Act 1: The GPU Hierarchy
Three hierarchies get taught together because they interact constantly: execution hierarchy, hardware hierarchy and memory hierarchy.
- Execution: grid, blocks, warps, threads.
- Hardware: GPU, SMs, tensor cores / CUDA cores.
- Memory: registers, shared memory / SRAM, L2, HBM.
Small mental note: each block runs on an SM (Streaming Multiprocessor, think like a CPU); the GPU chip is the grid. Within the hierarchy, registers are closest/fastest, then shared memory, then global memory/HBM.
Core idea: compute is cheap; moving data is expensive. Most of the resources below are different ways of moving fewer bytes, moving smaller bytes, reusing bytes more often, or keeping the GPU busy while bytes are moving.
Resources
- GPU Execution Model - Modern GPU Programming For MLSys - concise overview of the GPU execution model.
- Modern GPU Programming for ML Systems - full course/book-style path for GPU programming in ML systems.
- Give Me 30 min, I’ll Make CUDA Click Forever - CUDA mental model video.
- CUDA MODE lectures - CUDA/GPU programming lectures.
- CUDA MODE lecture 001 materials - lecture materials from the replay dump.
- Stanford CS149 Lecture 1: Why Parallelism? Why Efficiency? - parallel-computing context.
Act 1.5: The GPU Is a Moving Target
There is not one GPU. Each generation moves the numbers: HBM bandwidth and capacity, tensor-core throughput, supported precision formats and async data-movement machinery.
That matters because the best optimization depends on the hardware. Ampere, Hopper and Blackwell do not have the same precision formats, memory hierarchy details, or kernel sweet spots.
Resources
- Modern GPU Programming for ML Systems - useful for the generation-by-generation hardware context.
- Some matrix multiplication engines are not as accurate as we thought - hardware/runtime numerics are part of performance engineering.
- AI Systems Performance Engineering - book reference for AI performance engineering.
Act 2: Performance Models and Rooflines
Before touching a kernel, identify which resource is limiting it:
- Compute: math units are saturated.
- Memory: math units are waiting for bytes.
- Overhead: launches/setup/synchronization dominate.
The roofline model is the main picture: low arithmetic intensity lives on the memory-bandwidth diagonal; high arithmetic intensity hits the compute ceiling. A lot of deep learning optimization is pushing work up and right by increasing reuse or avoiding materialization.
Resources
- Making Deep Learning Go Brrrr From First Principles - roofline model, arithmetic intensity, and the base mental model for “why isn’t this faster?”
- How To Scale Your Model - the JAX scaling book; systems view of LLMs, rooflines, parallelism and large-scale training economics.
- Transformer Inference Arithmetic - forward-pass and KV-cache arithmetic for inference.
- Machine Learning Systems - broad ML systems book/reference.
- A Hitchhiker’s Guide to ML Training Infrastructure - broad overview of training infrastructure and hardware acceleration.
Act 3: Profiling, torch.compile and Fusion
The first optimization baseline is measurement. After that, the first automatic optimization is often compilation/fusion: capture the graph, remove Python overhead, fuse operations, and avoid HBM round-trips for intermediates.
The catch: graph capture is shape-sensitive. Variable sequence length, video resolution, frame count and batch shape can turn “free speedup” into repeated recompilation unless shapes are bucketed or handled deliberately.
Resources
- Profiling in PyTorch, Part 1: A Beginner’s Guide to torch.profiler - getting useful traces out of PyTorch.
- Profiling in PyTorch, Part 2: From nn.Linear to a Fused MLP - profiling through to fusion.
- GPU MODE Lecture 1: How to profile CUDA kernels in PyTorch - Nsight/PyTorch profiling notes.
- Aritra on X: Profiling deep learning layers - thread on layer profiling.
- Making GPUs Actually Fast: A Deep Dive into Training Performance - Jane Street video on training performance.
- JINO-ROHIT/ml-systems-notes - notes around Torch, distributed systems and ML systems.
Act 4: Custom Kernels and FlashAttention
Do not materialize large intermediates in HBM if they can be tiled, streamed, fused or consumed immediately. Do as much as you can in SRAM, or the shared memory.
Matmul is the warmup example: naive global-memory reads, then coalescing, shared-memory tiling, register tiling, and eventually more hardware-specific tricks like async copies and warp specialization.
FlashAttention is the attention version of the same idea: never build the full score matrix in HBM. Tile the computation, keep the online softmax state, and stream blocks through fast memory.
Kernel Resources
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance - Simon Boehm’s CUDA matmul worklog.
- Fast matrix multiplication on CPU - useful contrast with GPU performance thinking.
- Triton vector addition tutorial - first small Triton kernel.
- Triton Puzzles - practice problems for Triton.
- Hugging Face Kernel Builder - practical workflow for custom kernels.
- Lecture 106: Hugging Face Kernels - video on Hugging Face kernels.
- My first Multi-GPU kernel: Writing All-to-all for AMD MI300X - multi-GPU all-to-all kernel writeup.
- Making video go BRRRR - video performance thread.
FlashAttention Resources
- FlashAttention paper - attention as a memory-bound problem.
- FlashAttention Triton implementation - source-level reference.
- Flash Attention from Scratch Part 1 - implementation-oriented explanation.
- flash-attention-jax - JAX implementation.
- Causal FlashAttention in JAX - specific causal attention file.
Act 5: Multi-GPU Training and Communication
Once one GPU is not enough, the memory hierarchy extends across devices. Moving data between GPUs is another expensive rung, so the same “move less / overlap more” principle returns.
Core vocabulary:
- Data parallelism: replicate model, split data, all-reduce gradients.
- Tensor parallelism: split layer math across devices.
- Pipeline parallelism: split depth into stages; watch for bubbles.
- FSDP / ZeRO: shard parameters, gradients and optimizer states.
- Collectives: all-reduce, all-gather, reduce-scatter.
Resources
- Stanford CS336: Language Modeling from Scratch - broad spine for language-model systems.
- CS336 lecture 2 trace - PyTorch/einops trace from the course.
- CS336 Lecture 2: PyTorch/einops - video lecture.
- The Ultra-Scale Playbook - Hugging Face/Nanotron guide to large-scale training.
- Ultra-Scale Playbook: Gradient Accumulation - specific section from the replay.
- Ultra-Scale Playbook: Kernels - specific section from the replay.
- Smol Training Playbook - practical training setup and parallelism choices.
- Megatron-LM pipeline parallelism - pipeline bubbles and large-scale transformer training.
- Jino Rohit’s collective communication notes - communication primitives and distributed performance intuition.
- Transport Muon: Beating Muon in Speed and Performance in 1 Newton Step - optimizer/performance reading.
Act 6: Inference and Serving
Serving has a different performance shape from training. You are juggling requests, the KV cache, prefill/decode split, batching and scheduling.
- Prefill: lots of prompt tokens in parallel, often compute-heavy.
- Decode: one token at a time, repeatedly reading KV cache, often memory-bound.
- PagedAttention: treat KV cache like paged virtual memory to reduce fragmentation and increase concurrency.
Resources
- Fast LLM Inference From Scratch - arithmetic and performance constraints for LLM inference.
- PagedAttention - KV-cache paging as virtual memory.
- Popping the GPU Bubble - pipelined decoding and idle compute.
- Local LLM Inference Optimization: The Complete Guide - practical local inference optimization guide.
Act 7: Quantization, Smaller Bytes and the Right Kernel
If a workload is memory-bound and you cannot move fewer values, move smaller values. Quantization reduces memory footprint and bandwidth demand, but it changes the accuracy and kernel-choice story.
The right kernel depends on the workload. Dequant-then-compute kernels can be good for memory-bound decode; native low-precision GEMM can be better for compute-heavy prefill on new hardware.
Resources
- Gemma 4 QAT - Unsloth collection - quantization-aware training model collection.
- Gemma 4 QAT docs - Unsloth documentation for QAT models.
- unsloth/gemma-4-26B-A4B-it-qat-GGUF - GGUF model repo.
- The Magic of LLM Distillation - distillation talk, useful background for compression.
- Tim Dettmers - quantization and efficient training/inference writing.
Act 8: monokernels and killing boundaries
if kernel boundaries, launch overhead and stragglers waste time, one extreme answer is to fuse much more aggressively. A monokernel/megakernel tries to keep the GPU busy by removing boundaries, loading ahead, and avoiding idle bubbles.
Resources
- Designing a Monokernel - Hazy Research post on no-bubbles/monokernel design.
Compiler and Runtime Side Quests
Not everything is a Transformer. Tree inference and compiler/runtime work are also part of the performance map when the theme is “make model execution faster.”
- lleaves article - compiled LightGBM inference.
- Simon Boehm’s blog - CUDA, compiler and performance posts.
People and Feeds
People whose posts tend to feed this GPU/performance list.
- Simon Boehm - performance, inference and compiler-flavoured ML systems.
- Horace He - performance models and PyTorch internals.
- Tri Dao - attention, sequence models and systems-aware algorithms.
- Tim Dettmers - quantization and efficient training/inference.
- Mark Saroufim - PyTorch, kernels and production ML systems.
- Jino Rohit - distributed systems, Torch and collective communication notes.
- Ali Taha - model performance and video performance threads.
- Thien Tran - GPU kernels and systems notes.
- Daniel Han Chen - Unsloth / model efficiency.