Double-precision floating-point arithmetic (FP64) has been the de facto standard for engineering and scientific simulations for several decades. Problem complexity and the sheer volume of data coming from various instruments and sensors motivate researchers to mix and match various approaches to optimize compute resources, including different levels of floating-point precision. In recent years, machine learning has motivated hardware support for half-precision floating-point arithmetic. For Backup: C:\SmartBackup\apps\backup.exe For Export: C:\SmartBackup\apps\export.exe For Archive: C:\SmartBackup\apps\archive.exe In the 'Add arguments' field, you can specify arguments to run the task with special instructions. ![]() We show how the FP16/FP32 Tensor Cores on NVIDIA GPUs can be exploited to accelerate the solution of linear systems of equations Ax = b without sacrificing numerical stability.Ī primary challenge in high-performance computing is to leverage reduced-precision and mixed-precision hardware. The techniques we employ include multiprecision LU factorization, the preconditioned generalized minimal residual algorithm (GMRES), and scaling and auto-adaptive rounding to avoid overflow. We also show how to efficiently handle systems with multiple right-hand sides. On the NVIDIA Quadro GV100 (Volta) GPU, we achieve a 4 × − 5 × performance increase and 5× better energy efficiency versus the standard FP64 implementation while maintaining an FP64 level of numerical stability. Where A is a large, dense n × n non-singular matrix. This can be done at a speed that is close to the peak performance on current computer architectures, for example, by using libraries such as the NVIDIA cuS olver library, MAGMA and MKL that redesign and highly optimize the standard LAPACK algorithms for GPU and multi-core architectures. The solvers use direct methods in a fixed/working precision arithmetic, namely, the IEEE standard double-precision 64-bit floating-point arithmetic ( FP64), or single precision 32-bit floating-point arithmetic ( FP32). Recently, various machine learning and artificial intelligence neural network applications increased the need for half precision arithmetic, and vendors started to accelerate it in hardware, in the form of either the IEEE FP16 format or the bfloat16 format ( table 1). Currently, the NVIDIA V100 Tensor Cores (TCs) can execute FP16-TC at up to 120 teraFLOP s −1 versus 7.5 teraFLOP s −1 for FP64 and 15 teraFLOP s −1 for FP32. Thus, developing algorithms to exploit the much higher performance that lower-precision arithmetic offers can have a significant impact in scientific, high-performance computing (HPC). ![]() This paper presents a class of mixed-precision algorithms and an accompanying set of computational techniques that we have developed to accelerate ( 1.1) in FP64, which is the de facto standard for scientific computing. We show that the new mixed-precision techniques can accelerate the solution by a significant factor using the faster lower precisions, while still retaining FP64 quality. The mixed-precision iterative refinement algorithm computes an LU factorization of A in low precision, uses the LU factors to compute a initial approximation x 0 and then carries out an iterative refinement process in FP64 arithmetic.
0 Comments
Leave a Reply. |