CUDA 性能调优实战指南
从零到独立调优 AI Kernel 的系统学习路径
课程讲义
01
Roofline Mental Model — 先判断瓶颈,再谈优化
02
Reading Speed of Light — 读懂硬件性能极限
03
Memory Coalescing — 访存合并与对齐
04
Shared Memory — 共享内存优化
05
Bank Conflict and Tiling — Bank 冲突与分块
06
Occupancy and Launch Config — 占用率与启动配置
07
Dual GPU Roofline Hands-On — 双卡 Roofline 实操
08
Reduction Optimization — 规约优化
Roofline Cheatsheet
— 快速参考卡片