Demystifying Tensor Cores to Optimize Half-Precision Matrix Multiply
added on 2020/05/15 @ 16:34:33 | 626 views| category: programming
tags: #nvidia #turing #programming

Half-precision matrix multiply has played a key rolein the training of deep learning models. The newly designedNvidia Tensor Cores offer the native instructions for half-precision small matrix multiply, based on which Half-precisionGeneral Matrix Multiply (HGEMM) routines are developed andcan be accessed through high-level APIs. In this paper, we, forthe first time, demystify how Tensor Cores on NVIDIA Turingarchitecture work in great details, including the instructionsused, the registers and data layout required, as well as thethroughput and latency of Tensor Core operations.