D4.3 Second report on code profiling and bottleneck identification

Affinito, Fabio; Alekseeva, Uliana; Cavazzoni, Carlo; Degomme, Augustin; Delugas, Pietro D.; Ferretti, Andrea; Garcia, Alberto; Kozhevnikov, Anton; Ordejón, Pablo; Spallanzani, Nicola

In the present deliverable, we report the progress made on the benchmarking of the MAX flagship codes, with reference to the test cases defined in the D4.2 document. Importantly, part of the benchmarks run from M6 to M18 (May 2019 - May 2020) were already reported in D1.2, together with the release of the MaX codes in November 2019. In the present document we therefore report the newest data, mostly harvested in a benchmarking campaign held during spring 2020. Notably, in the month of March the production of Marconi100, a >30 PFlops cluster based on IBM Power9 + nVIDIA V100 cards, has started at Cineca. This gave us the unique opportunity to test the GPU porting of MaX codes at scale, especially during the setup and pre-production period of the machine. In this deliverable we report the early results from the benchmarks on this machine which allowed us to explore code behaviours in a very GPU-unbalanced architecture, even more relevant in view of the expected architecture of the EuroHPC pre-exascale machines to be deployed in early 2021. This campaign allowed us to find new bottlenecks and to target new development work. A number of the early identified problems have already been addressed and we were eventually able to run massively parallel calculations using MaX codes (e.g. a ~20 PFlops single run of Yambo on 600 nodes out of 980 of Marconi100, to name one). Concerning Quantum ESPRESSO (QE), the GPU port was extensively checked, also on large scale systems. Results are very promising and helped us to identify memory footprint bottlenecks, especially during diagonalization, furthermore stressing the need for GPU-aware distributed linear algebra primitives. The Car-Parrinello kernel of QE was also recently ported to GPUs and benchmarked at scale with very interesting results. Yambo was ported on Marconi100 and turned out to be in excellent shape for what concerns the GPU port, except for a performance loss due to the dipole kernel. The benchmark data allowed us to address it and to propose a solution. Even more than in the QE case, the inclusion of GPU-aware distributed linear algebra libraries aiming at controlling memory usage has been found to be quite critical. FLEUR continued its work on the JURECA cluster at Juelich, especially in the direction of improving the load-balancing of the matrix setup. A new exploitation of the k-point parallelism for large unit cells is discussed and, finally, the case for performance fluctuations is reported. BigDFT reports the development for the execution of calculations inside AiiDA. In addition it discusses in depth the results coming from the development of libconv, a separate library used for the calculation of convolution elements which permits, using code-generation with a metaprogramming approach, to target many different underlying computer architectures. www.max-centre.eu 4HORIZON2020 European Centre of Excellence Deliverable D4.3 Second report on code profiling and bottleneck identification CP2K reports on the results coming from the adoption of the COSMA library. These results look quite good, in particular for cRPA calculations. SIESTA reports on the substantial speedups that can be achieved by using recent GPU-enabled versions of the ELPA library (directly and through the ELSI solver interface library). The SIESTA section also shows that the PEXSI method (not based on diagonalization) still offers the best scaling and massively-parallelization opportunities. To finish, we report a proof-of-concept of the utilisation of AiiDA as a benchmarking tool, discussing pros and cons in comparison with JUBE, another popular tool for benchmarks and analysis of performances.