- CUDA Programming Guide and CUDA C++ Best Practices Guide
- How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog
- PyTorch Performance Tuning Guide
- Earlier version of this guide from NVIDIA
- Docs for caching memory allocation in PyTorch
- Overview of
timeit
for microbenchmarking - PyTorch Benchmark tutorial
- Links on floating point precision in different libraries and environments: 1 2
- On threading in PyTorch