D2.3 Third release of MAX software: Report on the release of documentation of the performance optimised parts

Wortmann, Daniel; Baroni, Stefano; Degomme, Augustin; Delugas, Pietro; DE GIRONCOLI, STEFANO MARIA; Ferretti, Andrea; Garcia, Alberto; Genovese, Luigi; Giannozzi, Paolo; Kozhevnikov, Anton; Spallanzani, Nicola

Ensuring an efficient use of the upcoming pre- and exascale architectures is one of the key targets of the MAX project. In this context WP2 focuses on efforts to provide the MAX flagship codes with performance portable computational kernels. Together with the activities reported in WP4, a trend analysis clearly shows that forthcoming HPC architecture (including those relevant for EuroHPC) will be largely heterogeneous and mostly dominated by GPU accelerators. Therefore, we focused our attention on the corresponding programming frameworks and challenges. Our work is mostly focused on the specific needs of our codes and methods, and we employed a variety of software engineering approaches and programming models for GPU accelerators. Beyond the use of Cuda(-Fortran) to extend the CPU performance of the codes to NVidia-based GPU hardware, the directive based programming models OpenACC and OpenMP were employed to exploit the capabilities of hardware across vendors. As a main outcome of these activities, all MAX-flagship codes are now able to exploit the compute power of at least the most widespread (Intel and AMD) CPU and (NVidia) GPU architectures. These efforts are well documented by corresponding performance figures as obtained on some of the most recent supercomputers in Europe. We not only achieve high single node performance or satisfactory utilisation of a single GPU/node, but also focused on parallelization efficiency and demonstrate in multiple benchmarks and use-cases the parallel scaling and efficiency over many nodes. Throughout this deliverable, these achievements are demonstrated by numerical example and benchmarks for each MAX flagship code. Overall, this shows a clear path to the use of a significant share of the computational resources available on current supercomputers, as well as on future (pre-)exascale machines. Many kernels and computationally relevant parts of the codes and libraries have been redesigned to be able to provide performance using portable frameworks. Thereby, we prepare for future architectures like AMD-based GPUs (as e.g. used in the coming pre-exascale machine LUMI) or Intel GPUs (expected to be released soon). However, due to the lack of hard- and software support results are still sparse for these efforts, and we expect to be able to obtain convincing numbers only after these systems are out into production and their compilers and low-level software is released.