High Performance Computing on Vector Systems-P7: In March 2005 about 40 scientists from Europe, Japan and the US came together the second time to discuss ways to achieve sustained performance on supercomputers in the range of Teraflops. The workshop held at the High Performance Computing Center Stuttgart (HLRS) was the second of this kind. The first one had been held in May 2004. | Atomistic Simulations 179 The Opteron system also shows excellent performance but only for the two larger system sizes. The small systems seem to suffer from the interconnect latency. The performance penalty saturates however at about 20 . We should also mention that these measurements have been made with binaries compiled with gcc. We expect that using the PathScale or Intel compilers would result in a 5-10 improvement. Finally the IBM regatta system is the slowest of the four but also shows excellent scaling for all system sizes. For very small CPU numbers the performance was a bit erratic which may be due to interferences with other processes running on the same 32 CPU node. Number of CPUs Fig. 3. Scaling of IMD on the Itanium top and Xeon bottom systems Please purchase PDF Split-Merge on to remove this watermark. 180 F. Gahler K. Benkert Number of CPUs Number of CPUs Fig. 4. Scaling of IMD on the Opteron top and IBM Regatta bottom systems 4 Classical Molecular Dynamics on the NEC SX The algorithm for the force computation sketched in Sect. suffers from two problems when executed on vector computers. The innermost loop over interacting neighbor particles is usually too short and the storage of the particle data in per-cell arrays leads to an extra level of indirect addressing. The latter problem could be solved in IMD by using a different memory layout for the vector version in which the particle data is stored in single big arrays and not in per-cell arrays. The cells then contain only indices into the big particle list. In order to keep as much code as possible in common between the vector and the scalar versions of IMD all particle data is accessed via preprocessor macros. Please purchase PDF Split-Merge on to remove this watermark. Atomistic Simulations 181 The main difference between the two versions of the code is consequently the use of two different sets of access macros. The problem of the short loops has to be solved