NEXTMUSE - Deliverable D4.3 Second intermediate report on efficient parallel SPH and I/O

Guiber, D; Le Touzè, D; Leboeuf, F; Marongiu, J; Marrone, S; Oger, G; Quinlan, N

In the previous deliverable D4.2 a study of different parallelization strategies was presented. The report focused on main aspects of the hardware considerations to achieve a good efficiency. The first one was the memory access with cache considerations: data must be gathered as possible to avoid cache misses and to enhance the prefetch algorithm embedded into some processors. It was also shown that some care must done to store in an efficient way the particle data such that a link between spatial neighborhood and cache memory neighborhood has to be established. This approach called particle data sorting, allows a reduction of cache misses when computing the interactions between particles, because data of two spatially neighboring particles would be also closed in the cache memory. Particle data sorting based on Peano-Hilbert curves shows some good speed-ups. The second point was the consideration of shared memory multiprocessor (SMP) architectures using the OpenMP API. These developments have been implemented into the two 2D codes at INSEAN and HydrOcean and they showed parallel results that were very encouraging. Nevertheless this approach is limited by the number of available cores on standard SMP machines and aim to be extended by a hybrid OpenMP/MPI implementation to target large scale parallel machines and clusters. The third point presented was the implementation on distributed memory architectures using MPI paradigm. Parallel results were presented on Andritz/ECL code. It showed that special care must be paid to the domain decomposition (dichotomy based) procedure since this procedure has to be triggered periodically in the course of the simulation because of particles motion. And they also illustrated the fact that the efficiency collapses quickly when the number of particles per core is insufficient. The SPH codes of the project members target different end-user applications and do not implement exactly the same numerical methods but they can all be enhanced by the previously studied parallel strategies. As an example, in the INSEAN SPH code a hybrid OpenMP/MPI implementation has been performed using the recommendations of the deliverable D4.2. The good results obtained are shown in Section 2. In real test cases with millions of particles, the domain decomposition procedure can take several hours and it was discovered that it became the main bottleneck on large scale architectures. An enhancement is proposed in Section 3 based on an Orthogonal Recursive Bisection (ORB) method with particles exchange based on a parallel sorting algorithm. This method suffers of efficiency when few processors are available while the previous implemented method, an ORB based on a dichotomy algorithm, becomes weak when many processors are used. So these two methods are combined, providing good results as shown in Section 3. A deep study of hybrid OpenMP/MPI or plain MPI implementations shows that the MPI communications are not fully overlapped by computations while they are coded using non-blocking MPI functions. The codes ASPHODEL and SPH-Flow are similar regarding the communication patterns and therefore a common analysis is performed in Section 4. Preliminary profiling results are shown. Two new approaches are proposed: a first one which is less intrusive and a second one taking into account the communications of the travelling particles arising from the domain decomposition procedure. Their implementations should increase the efficiency when the number of particles per core decreases or when the number of processors increases. The last section deals with the question of saving/reading the results of the simulation on large scale architecture. Writing efficiently several millions of particles on many processors is a technical challenge. To be interoperable with other parts of the project, the introduction of the HDF5 library has been studied and its efficiency will be shown compared to previous writing procedures.