Kernel optimization
Kernel optimization is about making GPU kernels run faster and more efficiently by improving how they use compute, memory bandwidth, and on-chip resources. For LLM inference, this often means reducing memory movement, increasing hardware utilization, and mapping workloads more carefully to the GPU.
📄️ Kernel optimization for LLM Inference
Kernel optimization for LLM inference improves GPU utilization and performance by writing or generating optimized kernels tailored to the compute patterns of LLMs.
📄️ GPU architecture fundamentals
Understand GPU architecture fundamentals for kernel optimization, including threads, warps, streaming multiprocessors, memory hierarchy, and tensor cores.
📄️ Choosing the right kernel optimization tool
Compare the main tools for kernel optimization in LLM inference, from cuBLAS and cuDNN to TVM, XLA, Triton, custom CUDA kernels, Mojo and MAX.
📄️ FlashAttention
FlashAttention is a fast, memory-efficient attention algorithm for Transformers that accelerates LLM training and inference and helps achieve longer context windows.
Stay updated with the handbook
Get the latest insights and updates on LLM inference and optimization techniques.
- Monthly insights
- Latest techniques
- Handbook updates