High Performance Computing on Vector Systems-P8: In March 2005 about 40 scientists from Europe, Japan and the US came together the second time to discuss ways to achieve sustained performance on supercomputers in the range of Teraflops. The workshop held at the High Performance Computing Center Stuttgart (HLRS) was the second of this kind. The first one had been held in May 2004. | Simulations of Supernovae 209 used in both the hydrodynamic as well as the neutrino transport parts of the code. Thus one needs to perform logically independent lower-dimensional subintegrations in order to solve a multi-dimensional problem. For instance the N x Nv and Nr x N x Nv integrations resulting within the r and 0 transport sweeps respectively can be performed in parallel with coarse granularity. The routines used to perform the lower-dimensional sub-integrations are then completely vectorized. Figure 4 shows scaling results of the OpenMP code version on an SGI Altix 3700 Bx2 using Itanium2 CPUs with 6 MB L3 caches . The measurements are for the S and M setups of Table 1. The Thomas solver has been used to invert the Jacobians. The speedup is initially superlinear while on 64 processors it is close to 60 demonstrating the efficiency of the employed parallelization strategy. Note that static scheduling of the parallel sub-integrations has been applied because the Altix is a ccNUMA machine which requires a minimization of remote memory references to achieve good scaling. Dynamic scheduling would not guarantee this although it would actually be preferable from the algorithmic point of view to obtain optimal load balancing. Table 1. Some typical setups with different resolutions. Setup Nrhyd Nr N N Nv XS 400 234 32 17 3 S 400 234 126 17 3 M 400 234 256 17 3 L 800 468 512 17 3 XL 800 468 512 34 3 Fig. 4. Scaling of PROMETHEUS VERTEX on the SGI Altix 3700 Bx2. Please purchase PDF Split-Merge on to remove this watermark. 210 K. Kifonidis et al. Table 2. First measurements of the OpenMP code version on a single compute node of an NEC SX-6 and an NEC SX-8. Times are given in seconds. Measurements on the SX-6 Setup NCPUs avg. wallclock time cycle Speedup MFLOPs sec XS 1 2708 XS 4 9339 XS 8 15844 Measurements on the SX-8 Setup NCPUs avg. wallclock time cycle Speedup MFLOPs sec XS 1 4119 XS 4 .