One of the most time-consuming tasks in the procedures for the numerical study of PDEs is the solution to linear systems of equations. To that purpose, iterative solvers are viewed as a promising alternative to direct methods on high-performance computers since, in theory, they are almost perfectly parallelizable. Their main drawback is the need of finding a suitable preconditioner to accelerate convergence. The factorized sparse approximate inverse (FSAI), mainly in its adaptive form, has proven to be an effective parallel preconditioner for several problems. In the present work, we report about two novel ideas to dynamically compute, on graphics processing units (GPUs), the FSAI sparsity pattern, which is the main task in its setup. The first approach, borrowed from the CPU implementation, uses a global array as a nonzero indicator, whereas the second one relies on a merge-sort procedure of multiple arrays. We will show that the second approach requires significantly less memory and overcomes issues related to the limited global memory available on GPUs. Numerical tests prove that the GPU implementation of FSAI allows for an average speed-up of 7.5 over a parallel CPU implementation. Moreover, we will show that the preconditioner computation is still feasible using single precision arithmetic with a further 20% reduction of the setup cost. Finally, the strong scalability of the overall approach in shown in a multi-GPU setting.
A dynamic pattern factored sparse approximate inverse preconditioner on graphics processing units
Bernaschi M;Carrozzo M;
2019
Abstract
One of the most time-consuming tasks in the procedures for the numerical study of PDEs is the solution to linear systems of equations. To that purpose, iterative solvers are viewed as a promising alternative to direct methods on high-performance computers since, in theory, they are almost perfectly parallelizable. Their main drawback is the need of finding a suitable preconditioner to accelerate convergence. The factorized sparse approximate inverse (FSAI), mainly in its adaptive form, has proven to be an effective parallel preconditioner for several problems. In the present work, we report about two novel ideas to dynamically compute, on graphics processing units (GPUs), the FSAI sparsity pattern, which is the main task in its setup. The first approach, borrowed from the CPU implementation, uses a global array as a nonzero indicator, whereas the second one relies on a merge-sort procedure of multiple arrays. We will show that the second approach requires significantly less memory and overcomes issues related to the limited global memory available on GPUs. Numerical tests prove that the GPU implementation of FSAI allows for an average speed-up of 7.5 over a parallel CPU implementation. Moreover, we will show that the preconditioner computation is still feasible using single precision arithmetic with a further 20% reduction of the setup cost. Finally, the strong scalability of the overall approach in shown in a multi-GPU setting.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.