Kernel optimization

Kernel optimization is about making GPU kernels run faster and more efficiently by improving how they use compute, memory bandwidth, and on-chip resources. For LLM inference, this often means reducing memory movement, increasing hardware utilization, and mapping workloads more carefully to the GPU.

📄️ Kernel optimization for LLM Inference

Kernel optimization for LLM inference improves GPU utilization and performance by writing or generating optimized kernels tailored to the compute patterns of LLMs.

📄️ GPU architecture fundamentals

Understand GPU architecture fundamentals for kernel optimization, including threads, warps, streaming multiprocessors, memory hierarchy, and tensor cores.

📄️ Choosing the right kernel optimization tool

Compare the main tools for kernel optimization in LLM inference, from cuBLAS and cuDNN to TVM, XLA, Triton, custom CUDA kernels, Mojo and MAX.

📄️ FlashAttention

FlashAttention is a fast, memory-efficient attention algorithm for Transformers that accelerates LLM training and inference and helps achieve longer context windows.

Stay updated with the handbook

Get the latest insights and updates on LLM inference and optimization techniques.

Monthly insights
Latest techniques
Handbook updates