CNR Institutional Research Information System

Linear solvers are key components in any software platform for scientific and engineering computing. The solution of large and sparse linear systems lies at the core of physics-driven numerical simulations relying on partial differential equations (PDEs) and often represents a significant bottleneck in data-driven procedures, such as scientific machine learning. In this paper, we present an efficient implementation of the preconditioned s-step Conjugate Gradient (CG) method, originally proposed by Chronopoulos and Gear in 1989, for large clusters of Nvidia GPU-accelerated computing nodes. The method, often referred to as communication-reduced or communication-avoiding CG, reduces global synchronizations and data communication steps compared to the standard approach, enhancing strong and weak scalability on parallel computers. Our main contribution is the design of a parallel solver that fully exploits the aggregation of low-granularity operations inherent to the s-step CG method to leverage the high throughput of GPU accelerators. Additionally, it applies overlap between data communication and computation in the multi-GPU sparse matrix-vector product. Experiments on classic benchmark datasets, derived from the discretization of the Poisson PDE, demonstrate the potential of the method.

Communication-reduced Conjugate Gradient Variants for GPU-accelerated Clusters

Bernaschi Massimo;Carrozzo Mauro;Celestini Alessandro;Piperno Giacomo;D'ambra Pasqua

2025

Abstract

Linear solvers are key components in any software platform for scientific and engineering computing. The solution of large and sparse linear systems lies at the core of physics-driven numerical simulations relying on partial differential equations (PDEs) and often represents a significant bottleneck in data-driven procedures, such as scientific machine learning. In this paper, we present an efficient implementation of the preconditioned s-step Conjugate Gradient (CG) method, originally proposed by Chronopoulos and Gear in 1989, for large clusters of Nvidia GPU-accelerated computing nodes. The method, often referred to as communication-reduced or communication-avoiding CG, reduces global synchronizations and data communication steps compared to the standard approach, enhancing strong and weak scalability on parallel computers. Our main contribution is the design of a parallel solver that fully exploits the aggregation of low-granularity operations inherent to the s-step CG method to leverage the high throughput of GPU accelerators. Additionally, it applies overlap between data communication and computation in the multi-GPU sparse matrix-vector product. Experiments on classic benchmark datasets, derived from the discretization of the Poisson PDE, demonstrate the potential of the method.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno
	
				2025
			
	Strutture organizzative
	
				Istituto Applicazioni del Calcolo ''Mauro Picone''
			
	Parole chiave
	
				communication-reduced algorithms
GPUs
linear solvers
s-step preconditioned Krylov methods
			
	Appare nelle tipologie:
	
				04.01 Contributo in Atti di convegno

File in questo prodotto:

File	Dimensione	Formato
2501.03743v1.pdf accesso aperto Tipologia: Documento in Pre-print Licenza: Creative commons Dimensione 1.35 MB Formato Adobe PDF Visualizza/Apri	1.35 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/562542

Citazioni

ND

1

ND

social impact