Nowadays, performance in HPC applications focuses on MPI efficiency as the de facto message-passing library to exploit parallelism. Features such as multithread and communication and processing overlap are continuously studied to adapt to new platforms and a more significant number of processing units like GPU platforms. In this sense, recently, the MPI-4.0 standard introduced the partitioned point-to-point communication primitives to potentiate computation and communication overlapping. This paper introduces an innovative extension to MPI, specifically addressing partitioned communication for MPI-reduction primitives. Traditional reduction tasks conventionally involve processing the complete input vector following the conclusion of GPU computations. In contrast, our proposed methodology exploits message partitioning to process reduction tasks in real-time incrementally. This approach allows the system to process individual partitions of the input vector as they become available, removing the necessity to await the full completion of GPU computations before initiating the reduction. Our results demonstrate promising benefits, particularly for large message sizes. However, it is essential to acknowledge that optimizations at synchronization points remain potential bottlenecks, requiring meticulous analysis and consideration.

Partitioned Reduction for Heterogeneous Environments

Giordano A.;D'ambrosio D.;
2024

Abstract

Nowadays, performance in HPC applications focuses on MPI efficiency as the de facto message-passing library to exploit parallelism. Features such as multithread and communication and processing overlap are continuously studied to adapt to new platforms and a more significant number of processing units like GPU platforms. In this sense, recently, the MPI-4.0 standard introduced the partitioned point-to-point communication primitives to potentiate computation and communication overlapping. This paper introduces an innovative extension to MPI, specifically addressing partitioned communication for MPI-reduction primitives. Traditional reduction tasks conventionally involve processing the complete input vector following the conclusion of GPU computations. In contrast, our proposed methodology exploits message partitioning to process reduction tasks in real-time incrementally. This approach allows the system to process individual partitions of the input vector as they become available, removing the necessity to await the full completion of GPU computations before initiating the reduction. Our results demonstrate promising benefits, particularly for large message sizes. However, it is essential to acknowledge that optimizations at synchronization points remain potential bottlenecks, requiring meticulous analysis and consideration.
2024
Istituto di Calcolo e Reti ad Alte Prestazioni - ICAR
distributed computing
GPU programming
MPI
partitioned communication
File in questo prodotto:
File Dimensione Formato  
Partitioned_Reduction_for_Heterogeneous_Environments.pdf

solo utenti autorizzati

Licenza: Altro tipo di licenza
Dimensione 586.26 kB
Formato Adobe PDF
586.26 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/20.500.14243/508184
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact